IT06 IVR Thesis Report(Pratap Raju)

Interactive Voice Response System

Thesis submitted for the partial fulfillment of

Master of Technology in Information Technology

by

K.Pratap Kumar Raju 2000IT06

Under the guidance of

Dr. Rajendra Sahu

Indian Institute of Information Technology and Management

Gwalior-474005

January 2002

1

CERTIFICATE

This is to certify that the thesis entitled “Interactive voice response system” is being

submitted to Indian Institute of information technology and management, Gwalior for

the award of master of technology in information technology by K.Pratap Kumar Raju

is a record of bonafide work carried out by him under my supervision and guidance. It

is further certified that the work presented has reached a standard of the PG thesis and

it has not been to any other university or Institute for the award of any degree or

diploma.

Date:

Place: Dr. Rajendra Sahu

2

ACKNOWLEDGEMENT

This work is a result of inspiration, support, guidance, co-operation and

facilities that were extended to me at their best by persons at all levels. I

feel really proud to say that I have worked under the guidance of a cool

and helping personality Dr. Rajendra Sahu, Assistant Professor of the

institute.

I would like to express my gratitude to our M.Tech (I.T) Head Prof. G

K Sharma for his encouragement and providing special dedicated lab for

the Thesis work. I would also like to thank Mr Dosapati Suresh, project

leader in Krisn Information Technologies Limited, Hyderabad for his kind

patience to clear my technical queries.

I like to thank our Director Prof. D.P. Agarwal for providing all the

facilities and working environment in the institute. I also like to thank the

entire institute faculty who helped me directly or indirectly to complete

my thesis work.

I also thank my colleague Mr Sanjeev Manglani for helping me

regarding to queries in coding aspects of java language.

K. Pratap Kumar Raju

3

Abstract

Interactive voice response (IVR) systems have been around for some time to help

guide customers to appropriate business units or information. However, with the use

of Internet technologies and wireless phones on the rise, coupled with the rapid

development in the speech recognition and speech synthesis technologies, new doors

for voice technology are opening to test demand in the marketplace. What’s more

convenient than picking up a phone? One can have instant access to the information

needed to make business operate more efficiently. Many businesses are betting that

consumers will embrace any technology that provides real-time access to information

piped through their regular telephone, wireless phone or voice-connected handheld

device.

A system in which the input and/or output are through a spoken, rather than a

graphical, user interface is what we call as Interactive voice response system or simply

IVR system. The web has made it possible to access information at the click of the

mouse. In recent years the meaning of what a client has grown from the desktop

computers to other clients like phones and mobile pieces. This is where voice control

came in.

Analyzing the requirements of the need for developing the voice systems, my

dissertation work concentrate on how to develop an interactive voice response

website. Voice Web technology makes use of open Internet standards (Web

infrastructure) like Hypertext Transfer Protocol (HTTP), Secure Sockets Layer (SSL),

Cookies and Extensible Markup Language (XML) based VoiceXML for implementing

voice services over the telephone. System proposes a three-tier architecture. At the

client side it consists of a telephone or cell phone connected to a Public Switching

Telephone Network. In the middle tier it consists of voice server equipped with VoIP

gateway, which facilitate the users of PSTN to connect to the voice application that

works in the IP network. This voice server identifies the call made by users of

4

telephone network, initiates the voice application, presents the user with the required

information and terminates the call when the user wants to exit from the application.

Development of an application makes use of VXML to provide an efficient speech

interface, Java Speech Markup Language (JSML) to develop grammar files in Java

Speech Grammar Format (JSGF), Servlets to supply the requested information by the

Voice browser. Front end make use of VXML language, which consists of tags to

recognize the human input and record them for future use. VXML tags takes input

from the user in small phrases and send these parameters to back end Servlets. Servlets

are basically written in java to accept the parameters from the front end and use them

to get necessary information database server. Database server stores the information of

an enterprise or institute in terms of tables where one can store the necessary

information to present it to users. It can be used with any phone at anywhere. One can

don’t have to put up with entering data using tiny keypad, but rather one can interact

with the service in a very natural manner.

The dissertation work aims at developing an IVR system for IIITM. It promises a good

speech interface to make the user feel comfortable to interact with the system and

email reader, that will read the emails so that one can listen to his emails rather than

browsing through them.

5

Contents

Chapter 1 Introduction

1.1 Introduction to IVR system. - 1

1.2 Typical voice applications. - 1

1.3 How to create and deploy IVR applications.- 2

1.4 How do users can access IVR application? - 3

Chapter 2 Problem Formulation2.1 Problem Definition. - 5

2.2 Existing Flaws in the web architecture. - 5

2.2.1 Network deficiencies.

2.2.2 Web architecture deficiencies.

2.3 Examples of IVR applications. - 10

2.4 Literature Survey. - 12

2.5 Objectives of Study. - 15

2.6 Conclusion - 16

2.7 chapterization - 16

Chapter 3 System Components3.1 IVR system components - 17

3.1.1 Telephony Network

3.1.2 Voice XML gateway

3.1.3 Voice Server

3.1.4 Interaction between Web Server and Voice Server

3.1.5 Web Server

3.1.6 TCP/IP Network

6

Chapter 4 Methodology

4.1 Implementation details - 24

4.2 Application design and development - 25

4.3 Development tools - 25

4.3.1 Voice XML

4.3.2 Servlets

4.3.3 JSGF

4.3.4 Oracle database

4.4Speech interface design - 30

4.4.1 Methodology

4.5 IVR development aspects - 32

4.6 Deployment Procedure - 33

4.6.1 Working of the system

4.6.2 Practical issues in deploying IVR

4.6.3 Security issues

Chapter 5 Conclusion & Future Scope

5.1 Conclusion - 38

5.1.1 Minimized fetch delays

5.1.2 New way of grammar development

5.1.3 Email reader

5.2 Future scope - 39

References - 40

7

Chapter 1

1.1 Introduction

IVR systems, also called Voice Response Units (VRU), automate the handling of calls

by interacting with user .It takes the input from the user in voice and provides the

enterprise information by connecting one or more online databases. Popular IVR

applications include bank-by-phone, flight schedule retrieval, and automated order

entry and tracking. The common feature of these examples is that a caller's touch-tone

or spoken requests are answered with verbal information derived from a "live"

database. A significant percentage of installed IVR systems are used in front-end call

centers to reroute calls away from costly live agents. Over time, the IVR systems have

evolved, from being simple systems accepting touch-tone input to advanced voice

systems accepting near natural language-like voice inputs. Mostly IVR systems are

used in applications that require less information from the user side and more

information from the system side so that the system should not feel the trouble of

understanding large inputs, which are difficult to understand, by the speech

recognition engines of the system.

1.2 Typical Types of Voice Applications.

Voice applications will typically fall into one of the following categories. Queries and

Transactions.

Queries: In this scenario, a customer calls into a system to retrieve information from a

Web-based infrastructure. The system guides the customer through a series of menus

and forms by playing instructions, prompts, and menu choices using prerecorded audio

files or synthesized speech. The customer uses spoken commands or DTMF input to

make menu selections and fill in form fields. Based on the customer’s input, the

system locates the appropriate records in a back-end enterprise database. The system

presents the desired information to the customer; either by playing back prerecorded

audio files or by synthesizing speech based on the data retrieved from the database.

8

Examples of this type of self-service interaction include applications or voice portals

providing weather reports, movie listings, stock quotes, health-care-provider listings,

and customer service information (Web call centers).

Transactions: In this scenario, a customer calls into a system to execute specific

transactions with a Web-based back-end database. The system guides the customer to

provide the data required for the transaction by playing instructions, prompts, and

menu choices using prerecorded audio files or synthesized speech. The customer

responds using spoken commands or DTMF input. Based on the customer’s input, the

system conducts the transaction and updates the appropriate records in a back-end

enterprise database. Typically the system also reports back to the customer, either by

playing prerecorded audio files or by synthesizing speech based on the information in

the database records. Examples of this type of self-service interaction include

applications or voice portals for employee benefits, employee timecard submission,

financial transactions, travel reservations, calendar appointments, electronic

relationship management (ERM), sales automation, and order management.

1.3 Create and deploy voice applications

(i). An application developer can use a Voice Server to create a voice application

written in VoiceXML. The VoiceXML pages can be static, or they can be dynamically

generated using server-side logic such as CGI scripts, Java Beans, ASPs, JSPs, Java

servlets, etc.

(ii). Writing VoiceXML applications using any text editor, one may find it more

convenient to use a graphical development environment that helps to create and

manage VoiceXML files. WebSphere Studio and WebSphere Voice Toolkit support

the development of VoiceXML-based applications. (Optional) A system administrator

uses a Web application server program to configure and manage a Web server.

(iii). The developer publishes the VoiceXML application (including VoiceXML pages,

grammar files, any prerecorded audio files, and any server-side logic) to the Web

server.

9

(iv). The developer uses a desktop workstation and the Voice Server SDK to test the

VoiceXML application running on the Web server or local disk, pointing the

VoiceXML browser to the appropriate starting VoiceXML page.

(v). A telephony expert configures the telephony infrastructure, as described in the

product documentation for the applicable deployment platform.

(vi). The system administrator uses one of the deployment platforms to configure,

deploy, monitor, and manage a dedicated Voice Server.

(vii). The developer uses a real telephone to test the VoiceXML application running

on the Voice Server.

1.4 Information to access the deployed voice application

Once the voice applications are deployed, users simply dial the telephone number that

the user provide and are connected to the corresponding voice application. Answer the

telephone call Play a prompt Wait for the caller’s response Take action as directed by

the caller Complete the interaction

(i). A user dials the telephone number provided to the application. The Voice Server

answers the call and executes the application referenced by the dialed phone number.

(ii). The Voice Server plays a greeting to the caller and prompts the caller to indicate

what information he or she wants. The application can use prerecorded greetings and

Prompts or synthesize them from text using the text-to-speech engine. If the

application supports barge-in, the caller can interrupt the prompt if he or she already

knows what to do.

(iii). The application waits for the caller’s response for a set period of time. The caller

can respond either by speaking or by pressing one or more keys on a DTMF telephone

10

Keypad, depending on the types of responses expected by the application. If the

response does not match the criteria defined by the application (such as the specific

word, phrase, or digits), the voice application can prompt the caller to enter the

response again, using the same or different wording. If the waiting period has elapsed

and the caller has not responded, the application can prompt the caller again, using the

same or different wording.

(iv). The application takes whatever action is appropriate to the caller’s response. For

example, the application might update information in a database, retrieve information

from a database and speak it to the caller. It also involves store or retrieve a voice

message, launch another application, Play a help message after taking action, the

application prompts the caller with what to do next.

(v). The caller or the application can terminate the call. For example, the caller can

terminate the interaction at any time, simply by hanging up; the Voice Server can

detect if the caller hangs up and can disconnect itself. If the application permits, the

caller can use a command to explicitly indicate that the interaction is over (for

example, by saying “Exit”). If the application has finished running, it can play a

closing message and then disconnect.

11

Chapter 2

2.1 Problem Definition

Until recently, the World Wide Web has relied exclusively on visual interfaces to

deliver information and services to users via computers equipped with a monitor,

keyboard, and pointing device. In doing so, a huge potential customer base has been

ignored: people who (due to time, location, and/or cost constraints) do not have access

to a computer.

Many of these people do, however, have access to a telephone. Providing

“conversational access” (that is, spoken input and audio output over a telephone) to

Web-based data will permit companies to reach this untapped market. Users benefit

from the convenience of using the mobile Internet for self-service transactions, while

companies enjoy the Web’s relatively low transaction costs. And, unlike that rely on

dual tone multi-frequency (DTMF) (telephone key press) input, voice applications can

be used in a hands-free or eyes-free environment, as well as by customers with rotary

pulse telephone service or telephones in which the keypad is on the handset

Even though you made the website perfectly dynamic using many technologies, users

cannot feel it more comfortable as it requires them to sit in a static place before a

terminal and access the required information. But it’s not possible for mobile users, to

perform a transaction or get the desired information through desktops PC. What they

want is that they can be able to do it from anywhere through any network like PSTN,

Internet, mobile network.

2.2 Flaws in the existing infrastructure to implement the new technology

2.2.1 Network Deficiencies:

The existing IP architecture does provide poor quality of service in transfer of voice

due to following reasons.

12

(i). The existing network makes use of connection less unreliable Internet protocol and

hence you are not sure whether the packet will arrive at the destination or not.

Retransmission is not allowed in transferring the voice signals through the network

when some packets were collapsed in the travel due to congestion in the network

(ii). Long propagation delays due to unreliable congested network make listening to

voice ineffective.

(iii). Packets may arrive out of order as they take different routes in traveling through

the network, which leads to a problem of sequence. Out of sequence packets are not

acceptable in transfer of voice signals.

(iv). The devices in the network cause unpredictable amount of delay between the

packets, which is called as jitter. Large jitter causes unpredictable amount delay of

packets in reaching the destination, which will leads to a poor quality of voice.

(v). In dealing with the voice, there should be some mechanism, which cancels the

echo created during voice travel through the network.

So taking these deficiencies in to consideration one should develop a VoIP gateway

enabled network. The implementation of VOIP gateway is a must in order to adapt the

PSTN to IP.

2.2.2 Deficiencies in the existing Client-Server model

Classical client server model making use of three-tier architecture doesn’t support the

recognition of voice data. Hence changes are should be made to existing model so that

it can understand speech input. At present no browser support voice recognition and

no server can understand the voice requests made by the client. Hence one should have

a voice browser to run the voice applications. A speech recognition engine is used to

understand the user input in voice or DTMF and also a text to speech recognition

engine to speak the simulated voice out loud. A voice server can be developed in order

13

to implement the above-mentioned functions. Develop an interface between the voice

server and the web server.

Interactive voice response websites makes the information available in the World

Wide Web (WWW) to your public telephone or cell phones.

Interactive Voice Response (IVR) system enabled World Wide Websites (internet)

make the information reachable even to telephones and cell phones. This facilitates

the user to get the information easily by just dialing the particular server using

their handsets at any time round the clock.

IVR systems enable the users to do different types of transactions easily. eg:

checking the bank balances ,doing money transactions

IVR systems facilitate you to check your emails with just using telephones. The

system takes the necessary information form you and read the messages intended

for you in voice.

IVR systems are especially useful in case of call centers to respond to the

customers in voice and transfer the calls to other information systems.

Flaws in existing web sites are they are not voice interactive. By making a website

voice interactive you would be able to provide the information in voice presentation of

information in voice has many advantages which some of them are mentioned above.

Especially in case of any information queries where a client send little information as

request and more information to get from the server side voice interactive system will

be very much helpful to obtain the information with much. What you have to do is just

speak out small phrases of queries and listen to the required information.

Taking the flaws that are prevailing in the existing system in to consideration, one

would develop a system, which can interact effectively with user in voice and provide

the information in a form, which the user feels more comfortable. one can make the

information available to telephone and cell phones then it will be more advantageous

and help in substantial growth of organization.

14

Interactive Voice Response (IVR) applications enable callers to query and modify

database information over their telephone using their own human speech or by dialing

digits on their telephone. Callers can use their touch-tone pad to input requests or just

say what they want to do, such as ordering a product, obtaining a work schedule, or

requesting account balance information, and the database speaks information back to

the caller-using Text-to-Speech. IVR offers customers and businesses a new level of

freedom by enabling them to conduct transactions 24 hours a day, seven days a week.

Businesses of all sizes are realizing the tremendous benefits of IVR applications for

their call processing and information delivery needs. IVR functionality links a phone

system to a database to provide customers with 24-hour immediate access to account

information, via telephone. For example, a bank could make up to 10 data fields

available for a caller’s checking account, 10 data fields for his or her savings account,

and so on. To ensure security IVR can be set up to allow the caller access to account

information only if the caller enters a valid account number and corresponding

personal identification number.

IVR allows full connectivity to the most popular databases including Microsoft

Access, Microsoft Excel, Microsoft Fox Pro, DBase. One can read information from,

and write information to, databases, as well as make a query databases and can return

information. The application files can reside on the local system, an intranet, or the

Internet. Users can access the deployed applications anytime, anywhere, from any

telephony-capable device, and you can design the applications to restrict access only

to those who are authorized to receive it.

“Voice-enabling the World Wide Web” does not simply mean using spoken

commands to tell a visual browser to look up a specific Web address or go to a

particular bookmark Having a visual browser throw away the graphics on a traditional

visual Web page and read the rest of the information aloud converting the bold or

italics on a visual Web page to some kind of emphasized speech. Voice applications

provide an easy and novel way for users to surf or shop on the Internet—“browsing by

voice.” Users can interact with Web-based data (that is, data available via Web-style

architecture such as servlets, ASPs, JSPs, Java® Beans, CGI scripts, etc.) using speech

15

rather than a keyboard and mouse. The form that this spoken data takes is often not

identical to the form it takes in a visual interface, due to the inherent differences

between the interfaces. For this reason, transcoding—that is, using a tool to

automatically convert HTML files to VoiceXML—may not be the most effective way

to create voice applications. It will execute any created application when a caller dials

in and allows callers to interact with the system using both human speech and DTMF.

Advanced database technology permits reading, writing, appending, searching and

seeking database information.

It must be noted that voice based navigation can get complex. When implementing

information services on a web server, one can include a glut of information on the

page and over load paths to resources to make sure users reach their required

destination whatever their approach to searching for it. In voice applications, it

becomes more important to clearly define the information. Voice data is transient; it

depends on the users memory and ties in much more closely with preconceptions and

experience. Finally, our ability to focus on any one-voice source among many is

limited. The need to avoid ambiguity in the question/ Answer pattern of voice

interaction can be cause of very complex systems, and its very difficult to maintain

location information; keeping the user aware of where there are in application and

where they are in relating to other parts of the application, such as home page, end so

on. It is the characteristic of unpopular applications that user feels lost and out of

control.

The growing awareness of catering for a variety of needs and devices has highlighted

the important of voice control services and also the importance of making them

usable. Voice entry of textual data is very much clearer than using a phone keypad.

Current developments in wireless technology and increase in processors speed have

made speech applications a reality. With powerful servers for both speech processing

and wireless-based thin clients. Like mobile phones and PDAs, it is now possible to

interact with the user using audio input and output.

16

In addition to all these things VoiceXMl made the dream of developing voice-web

applications come to true. Voice XML is relatively a new specification of XML

designed to develop voice applications over the web. It has its root in a language

designed by Motorola by the name VXML, another specification for presenting

services and data in voice medium.

VoiceXML is a member of XML family, W3c specification for organizing data in a

document using a set of elements. Rules can be specified as either a document type

definition (DTD) or a schema. VoiceXML is one such type of schema. It consists of set

of rules that detail how to describe a voice transition using a markup language. Learn

more about VXML in the later chapters.

2.3 Some of the Interactive Voice Response applications

(i) Weather applications

In US most of the weather information applications are automated using IVR systems.

There user queries regarding to weather information is automatically answered by the

simulated voice generated by the TTS engine.

(ii) Online Shopping applications

In online purchasing any queries regarding to the items are answered by the automated

voice.

(iii) Online enquiry in railways and Airways

Information of the arrival, departure of the trains and reservation availability all these

information you can obtain from the automated response system.

Please speak the information about train number, source and destination stations .IVR

system will automatically generate a query on the database, get the information from

the database and will speak it out loud for you, the availability of the reservation.

17

(iv) Telemedicine

Now IVR systems even entered into medicine field. Electrocardiogram monitor gets

ECG data of the patient and transmit it over a regular telephone line.

(v) Tele-education

Education from distance places (Tele-education) is now a days possible after IVR

systems came into picture.

The Distance Education Centre in MONASH UNIVERSITY has introduced an

Interactive Voice Response System (IVR) to provide distance education to students

with information about dispatch of study materials and for providing enrolment

information. By using your telephone you can dial the 1901 number and obtain up to

date information about the materials dispatch for the subjects you are undertaking. If

you have not received your materials, or for missing materials, you may leave a

message on the system.

This service is available to callers using touch-tone telephones. If your phone makes a

different tone each time you press a number, then your phone is a touch-tone phone. If

you hear no tone or a number of clicks with each press of the numbers, then your

phone is operating in pulse mode and cannot access the system. You can purchase a

special adaptor from Telstra, which will enable you to use the system. Some

telephones have a switch or button, which allows you to change the mode to touch-

tone mode.

Calls are charged at the minimum rate of 35 cents per minute regardless of where you

call from, plus an initial charge of 15 cents. A higher charge will be incurred from

mobile telephones and public telephones. Students living overseas can access the

system by dialing 0055 31706 (preceded by the Australian International Code). The

rate is 75 cents per minute, plus the International access rate, which varies from

country to country.

18

(vi) Automatic Call answering in call centers

Before Voice web came into picture, Call centers use to spend a lot of money on the

call operators to answer the queries from the users. But now call centers fully operate

using voice web has the advantage of operating at low cost as TTS and voice

recognition engines came into image.

2.4 Literature Survey

Though voice has been used since beginning of the human race for communication

only recent developments in the technology have proved the research in the IVR

systems. In the IVR systems currently research is originally in areas related to VXMl,

Speech technologies and VoIP gateways.

VoiceGenie Technologies empowers every PC as a Voice access point, by making

them personal voice enabled gateways. Now using voice genie voice enabled gateway

made the PCs more than personal computing devices; they will be powerful

telecommunications servers that allow you to control your office, home, and more.

MyVoiceGenie will revolutionize how one can communicate.

Emerging Digital Concepts (EDC) is developing solutions for clients using a number

of leading speech recognition technologies, including Speech Works and Nuance.

These technologies are applied to some of the state of the art hardware available today

including Natural Micro Systems and Dialogic.

Computer Telephony Integration is a service provided to various clients for over 3

years. By now assisting clients in maximizing the capabilities of their existing CTI

platforms. This technology can increase the lifecycle and revenue generation life of

legacy CTI.

TigerJet Network provides Integrated software and silicon solutions for network

communication applications.

19

TigerJet's Gateway Manager application

Implement your own private VoIP gateway for Internet to regular phone calls.You can

Place a call using you own regular phone line from anywhere in the world.

Fig 2.3 Tjnet voice network

IP Phone integrates all popular choices for making Internet phone calls in a single easy

to use application with a central "one stop" interface.

The key features of IP Phone Center are:

PC to phone calls using Dynamic VoIP gateways

PC to phone calls using Static VoIP gateways (fixed IP)

PC to phone calls using Web call and your choice of provider

PC to PC calls over the Internet

Buddy List to make placing a call "one-click close"

One easy to use interface for all types of call

Support for handsets and regular phones

Nuance- delivers speech recognition, voice authentication, and text-to-speech

software that make the information and services of enterprises, telecommunications

networks and the Internet available from any telephone. Nuance is the leader in Voice

20

Web software — speech recognition, voice authentication, text-to-speech and voice-

browsing products that make the information and services of enterprises,

telecommunications networks and the Internet accessible from any telephone. SRI

International, one of the leading voice technology research entities throughout the

1980s and 1990s, established nuance as an independent company. Nuance offers its

products through industry partners, platform providers, and value-added resellers

around the world.

Cisco IP-powered Interactive Voice Response Solution- Cisco IP IVR is an IP-

powered interactive voice response (IVR) solution that provides an open, extensible,

and feature-rich foundation for the creation and delivery of IVR solutions via Internet

Technology. Cisco IP IVR automates the handling of calls by autonomously

interacting with users. The IP IVR processes user commands to facilitate command

response features such as access to checking account information or user-directed call

routing. The IP IVR also performs “prompt and collect” functions to obtain user data

such as passwords or account identification.

Cisco IP IVR is the first application product in a suite of application products

completely written in Java and completely designed and constructed by Cisco to

facilitate concurrent multimedia communication processing.

SRC TELECOM On 4 June 2001 – The SRC Telecom, the telephony based speech

recognition arm of SRC (The Speech Recognition Company), today announced it is

offering a VXML (voice XML) applications hosting service. SRC has installed a

VXML platform that will provide third parties with the first Europe based applications

hosting environment.

“VXML is developed as a leading standard for the implementation of telephony based

speech applications” said Chris Hart, Managing Director SRC Telecom. “Our

decision to embrace this technology and offer a secure, high quality hosting service to

third parties signifies SRC Telecom’s leadership in delivering the latest telephony

speech solutions.”

21

By developing applications in VXML, organizations one can benefit from the many

advantages associated with open standards based development environment. Most

notably, VXML provides significant efficiencies during the application design

process, ensures ease of software maintenance and allows greater portability of

applications. However, the development of speech applications that facilitates the

high end-user acceptance still requires substantial expertise in human factors

engineering, dialogue design and speech systems integration.

Speaking of Nortel and SpeechWorks- Nortel Networks and SpeechWorks will

combine Nortel's speech processing platform OSCAR (Open Signal Computing and

Analysis Resource) and IVR (interactive voice response) with Open Speech

Recognizer, the speech recognition engine that provides interactive capability with the

Web via phone or voice-capable browser when combined with voice recognition

technology. One can grab their phone and access the Web with his voice.

2.5 Objective of Study

From the survey of the work done by many organizations. IVR systems can be

developed in the web architecture using tools as follows.

Voice XML is found to be a powerful language to develop voice applications. It

consists of tags that recognize the user voice and also to record them. It is found to be

reasonable to work with it as it provides a strong platform to run the voice

applications.

JSGF provides a way to define the grammar files that help the system to check

whether the user input is valid or not. You can declare the small phrases or words as

options in the grammar and the user is required to speak these options in order to

select a particular option. Voice servers are being developed by many companies to

make the dream of deploying the application, which are reachable to both PSTN and

IP. One of the popular among them is the Voice server developed by IBM. It has many

versions, which can be run on windows 2000, Windows NT 4.0 and even on Linux.

Taking in to consideration the efficiency of Windows NT 4.0 and simplicity of the

user interface voice server for windows NT4.0 will be a best option. Latest versions

22

support new versions of tags that help in generating the simulated voice comparable

with the human voice. Java is one of the tools to develop server applications at the

back end. Servlets, which make use of java, and special API designed for various tasks

work satisfactorily to process the request from the voice browser.

Using the above mentioned tools the voice server which includes voice browser, voice

recognition engine and TTS engine and java speech grammar format the dissertation

work creates voice application that promise efficient speech user interface and user

friendly environment.

2.6 Conclusion

This chapter chronologically analyzes the tools and technologies for developing the

voice applications. It emphasizes the various tools and finds VXMl is the only

potential for the front-end future IVR. Using a Windows supported Voice server

development kit of IBM, java speech markup language and java Servlets at the back-

end one can develop a full pledged voice application that can be deployed on web

architecture.

2.7 Chapterization

The subsequent chapters deal with the system architecture, problem formulation, and

designing and development aspects of the system.

23

Chapter 3

3.1 IVR System components

IVR make use of three-tier architecture, which is shown in the figure 3.1.

In the 3-tier architecture the request from the user is dealt by the voice server which

appear at the front end of the application separately .The requests from the voice

server are send to web server located in middle tier. Web server with the support of

database server process the request and re-send the requested by the client.

Distributing the load in to different stages is an added advantage in the three-tier

architecture of IVR systems.

Fig 3.1 Voice enabled web architecture

The system consists of the following components at different levels.

(i) Telephone Network.

(ii) VXML gateway.

(iii) Voice Server.

24

Web Server

Oracle Database Server

VOIP GatewayVoice Server

Phone connected to a PSTN

(iv) Web application Server.

(v) Database Server.

3.1.1 Telephony Network is a PSTN (Public Switched Telephony Network), a

regular analog line or lines coming through a PBX (Private Board Exchange) system,

ISDN (Integrated Services Digital Network) lines or VoIP (Voice over IP) network.

The telephony network is connected to the VoiceXML gateway. The telephones can

be regular phones or IP (Internet Protocol) phones if connected to the VoIP network.

3.1.2 VoiceXML Gateway The purpose a gateway is used to transfer the data

between two networks, which adopt different protocols, and different data formats.

VoIP gateway is used to connect PSTN and IP network. IP network make use of

TCP/IP protocol suit, which transfer data in the packet format. PSTN transmits the raw

data bits through the network which got completely different data format when

compared TCP/IP network .It uses signaling and switching process (control plane and

data plane) two layers switches. VOIP gateway emulates the telephone network in the

IP network

A VOIP gateway consists of a series of Digital signal processors which performs the

following functions

(i) Voice Compression

As voice requires large bandwidth the voice signals need to be compressed to the

desired level with out any loss of information carried by the signals. The function of

compression and again converting in to original signal is done by the codec (which is a

combination of coder and a decoder). Voice compression is performed using the

digital modulation techniques like pulse code modulation.

(ii) Tone Detection/Generation

Whenever user lifts the phone it is the function of the gateway to generate and detect

the tone of different DTMF input, that is the destination number. A routing server

maps this number to an Internet address to identify the destination node.

25

(iii) Echo Cancellation

Echo generated when voice travels through the medium, is removed by the voice activity detector.

(iv) Silence Suppression

Silence is usually observed between sentences when a person speaks. Transmission

silence leads to wastage of bandwidth. Silence detector is employed to detect silence

and remove it to enhance the bandwidth utilization.

PSTN IP network

Fig 3.1.2 VOIP Gateway

3.1.3 Voice Server Voice Server mainly consists of Speech recognition engine, Text

to speech engine DTMF simulator as shown in the figure 4.1.

Speech recognition is the ability of a computer to decode human speech and convert it

to text. To convert spoken input to text, the computer first parse the input audio stream

and then convert that information to text output. The process of recognition takes

place like this. One can create a series of speech recognition grammars defining the

words and phrases that can be spoken by the user, and specifies where each grammar

26

Micro Processor

DSP

DSP

DSP

DSP

should be active within the application. When the application runs, the speech

recognition engine processes the incoming audio signal and compares the sound

patterns to the patterns of basic spoken sounds, trying to determine the most probable

combination that represents the audio input. Finally, the speech recognition engine

compares the sounds to the list of words and phrases in the active grammar(s). Only

words and phrases in the active grammars are considered as possible speech

recognition candidates. Any word for which the speech recognizer does not have a

pronunciation is given one and is flagged as an unknown word.

The key determinants of speech recognition accuracy are audio Input quality, interface

design, grammar design. The quality of audio input is a key determinant of speech

recognition accuracy. Audio quality is influenced by such factors as the choice of

input device i.e. microphone connected to a desktop workstation. For applications

deployed using the Web Sphere Voice Server, the input device could be a regular

telephone, cordless telephone, speakerphone, or cellular telephone. Speaking

environment, which could be in a car, outdoors, in a crowded room, or in a quiet

office. Certain user characteristics such as accent, fluency in the input language, and

any atypical pronunciations. While many of these factors may be beyond your control.

One should nevertheless consider their implications when design the applications.

Users will achieve the best possible speech recognition with a high-quality input

device that gives good signal-to-noise ratio. For desktop testing, use one of the

microphones listed at http://www.ibm.com/viavoice. Speech clarity is a significant

contributor to audio quality. Adult native speakers who speak clearly (without over-

enunciating or hesitating) and position the microphone or telephone properly achieve

the best recognition; other demographic groups may see somewhat variable

performance.

The design of the application interface has a major influence on speech recognition

accuracy. Words, phrases, and DTMF key sequences from active grammars are

considered as possible speech recognition candidates, what one chooses to put in a

27

grammar and when choosing to make that grammar active have a major impact on

speech recognition accuracy.

Text-to-speech conversion is the ability of a computer to “read out loud” (that is, to

generate spoken output from text input). Text-to-speech is often referred to as TTS or

speech synthesis. To generate synthesized speech, the computer must first parse the

input text to determine its structure and then convert that text to spoken output. One

can improve the quality of TTS output by using the speech markup elements provided

by the VoiceXML language, which is described later in the subsequent chapters. TTS

prompts are easier to maintain and modify than recorded audio prompts. For this

reason, TTS is typically used during application development.

VoiceXML browser is the implementation of the interpreter context as defined in the

VoiceXML 1.0 specification. One of the primary functions of the VoiceXML browser

is to fetch documents to process. The request to fetch a document can be generated

either by the interpretation of a VoiceXML document, or in response to an external

event. The VoiceXML browser manages the dialog between the application and the

user by playing audio prompts, accepting user inputs, and acting on those inputs. The

action might involve jumping to a new dialog, fetching a new document, or submitting

user input to the Web server for processing. The VoiceXML browser is a Java

application. The Java console provides trace information on the prompts played,

resource files fetched, and user input recognized; other than this and the DTMF

Simulator GUI, there is no visual interface. For more information, see “Using the

Trace Mechanism” and “Interactions with the DTMF Simulator”, respectively.

When Voice application is deployed in a telephony environment, users are allowed to

provide DTMF (telephone keypress) input in addition to spoken input. The DTMF

Simulator is a GUI tool enables to simulate DTMF tones on your desktop workstation.

The VoiceXML browser starts the DTMF Simulator automatically, unless one

specifies the -Dvxml.gui=false Java system property when starting the VoiceXML

browser. If the DTMF Simulator GUI window is closed, the only way to restart it is to

stop and restart the VoiceXML browser. The DTMF Simulator plus desktop

28

microphone and speakers take the place of a telephone during desktop testing,

allowing to debug VoiceXML applications without having to connect to telephony

hardware and the PSTN (Public Switched Telephone Network) or cellular GSM

(Global System for Mobile Communication).

Using the DTMF Simulator, one can simulate a telephone keypress event by pressing

the corresponding key on the computer keyboard or clicking on the corresponding

button on the DTMF Simulator GUI, shown in Figure 4.1.For example, if the

application prompt is “Press 5 on your telephone keypad,” one can simulate a user

response during desktop testing by clicking the 5 button on the DTMF Simulator GUI

or pressing the 5 key Figure 2. DTMF Simulator GUI on the computer keyboard while

the cursor focus is in the DTMF Simulator GUI window. The VoiceXML browser will

interpret your input as a 5 pressed on a DTMF telephone keypad. If the length of valid

DTMF input strings is variable, use the # key to terminate DTMF input.

Interactions with Text-to-Speech and Speech Recognition Engines

During initialization, the VoiceXML browser starts the TTS and speech recognition

engines. The VoiceXML browser uses telephony acoustic models in order to simulate

the behavior of the final deployed telephony application as closely as possible in a

desktop environment. As the VoiceXML browser processes a VoiceXML document, it

plays audio prompts using text-to-speech or recorded audio; for text-to-speech output,

it interacts with the TTS engine to convert the text into audio. Based on the current

dialog state, the VoiceXML browser enables and disables speech recognition

grammars. When the VoiceXML browser receives user audio input, the speech

recognition engine decodes the input stream, checks for valid user utterances as

defined by the currently active speech recognition grammar(s), and returns the results

to the VoiceXML browser. The VoiceXML browser uses the recognition results to fill

in form items or select menu options in the VoiceXML application. If the input is

associated with a <record> element in the VoiceXML document, the VoiceXML

browser stores the recorded audio. As the VoiceXML browser makes transitions to

new dialogs or new documents, it enables and disables different speech recognition

grammars, as specified by the VoiceXML application. As a result, the list of valid user

utterances changes. If the VoiceXML browser encounters an <ibmlexicon> element in

29

a VoiceXML document, it interacts with the speech recognition and TTS engines to

add or change the pronunciation of a word for the duration of the current VoiceXML

browser session.

3.1.4 Interactions with the Web Server and Enterprise Data Server

VoiceXML applications can be stored in any Web server running on any platform.

However, One make use of java web server, to reply the request generated by the

VXML documents. When starting the VoiceXML browser, the VoiceXML browser

sends an HTTP request over the LAN or Internet to request an initial VoiceXML

document from the Web server. The requested VoiceXML document can contain

static information, or it can be generated dynamically from data stored in an enterprise

database using the same type of server-side logic (CGI scripts, Java Beans, ASP, JSP,

Java Servlets, etc.) that is used to generate dynamic HTML documents.

The VoiceXML browser interprets and renders the document. Based on the user’s

input, the VoiceXML browser may request a new VoiceXML document from the Web

server, or may send data back to the Web server to update information in the back-end

database. The important thing is that the mechanism for accessing your back-end

enterprise data does not need to change.

3.1.5 Web server that runs the application logic, and may contain a database or

interface to an external database or transaction server.

3.1.6 TCP/IP (Transport Control protocol/Internet protocol) is packet-based

network that connects the application server and voice server via HTTP.

30

Chapter 4 Methodology

4.1 Implementation Details

After identifying the system components and their details, the present chapter

discusses the implementation of IVR application. Application makes use of

VoiceXML at the front end. The voicexml documents run on a speech or voice

browser. This voice browser executes the tags one by one in the order specified by the

form interpretation algorithm. Form interpretation algorithm identifies the form

elements and calls the speech recognition engine function calls or TTS engine function

calls to execute the tags. If the application requires dynamic data to be extracted from

the database it sends a request to the Servlet program. Servlets make use of database

connectivity to supply the necessary data to the voice browser. Voice browser makes

use of TTS engine to convert this data in to voice form and is spoken out loud.

Servlets run on a web application server. Application makes use of java web server as

web application server. Database information is stored in the form of tables in the

database server. Oracle 8i database server is found to be efficient and easy to store the

database.

1

3 5

2

Voice Server Web Server Database ServerFig 4 .1 Application components and data flow

31

4

SPEECH RECOGNITION ENGINE

TEXT TO SPEECH ENGINE

VOICE XML SPEECH BROWSER

DTMFSIMULATOR

VXML APPLICATION

VXML APPLICATION

ENTERPRISE DATABASE

VXML APPLICATION

VXML APPLICATION

Tables

Tables

Tables

1.Voice in, 2.Audio or synthesized speech output, 3.Voicexml via http over LAN or

Internet, 4.DTMF in, 5.Database connectivity

4.2 Application Design and Development

Interactive voice response system is designed to contain the following information.

(i). Information regarding the institute establishment and the institute profile.

(ii). Information regarding the students of IIITM.

(iii). Information regarding the IIITM faculty.

(iv). Email reader facility.

(v). Eligibility and selection criteria for various courses of IIITM for the students

and faculty.

(vi). Special announcements regarding to the result declaration of students selected,

achievements of the institute regarding to the summer, final placements of the

students.

(vii). Exit from the site, if the user wants to come out of the system at any stage.

For development purpose use VXML, JSGF, at the front end and Servlets and java at

the back end in order to form a rigid and flexible system. For development of grammar

files, use java speech grammar format file (JSGF), threads and JDBC concepts.

4.3 Development tools include

VXML (Tag language used to interact with the user.), Java (For server side scripting),

Servlets (Server side programming), JSGF (Grammar to recognize the user input in

voice), Java speech API (To validate the user input and supports Vxml tags), Oracle

database server (To store the database of in the form of tables), Java web server (To

run the Servlets), voice server (Provides tools to recognize the voice and generate

simulated voice output).

32

4.3.1 VoiceXML

VXML is XML-based markup language for creating distributed voice applications,

much as HTML is a markup language for creating distributed visual applications.

VoiceXML supports dialogues that feature, spoken input, DTMF (telephone key)

input, recording of spoken input, synthesized speech output ("text-to-speech"), pre-

recorded audio output. VoiceXML makes building speech applications easier, in the

same way that HTML simplifies building visual applications.

These files define the voice user interaction and dialog flow control.

Grammar Files define the valid commands that are allowed during the voice

interaction. Grammar can be defined at the development stage or generated

dynamically at the run time. Audio Files are prerecorded audio files that are played

back, or the recordings of the user’s input. VoiceXML language provides features for

four major components of Voice Web: voice dialogs, platform control, telephony,

performance. Each VoiceXML document consists of one or more dialogs. The dialog

features cover the collection of input, generation of audio output, handling of

asynchronous events, performance of client-side scripting and dialog continuation.

Telephony features include simple connection control (call transfer, add 3rd party, call

disconnect) and telephony information like Automatic Number Identification (ANI)

and Dialed Number Information

VoiceXML Concepts

An application is a set of VoiceXML documents sharing the same application root

document. The application root document remains loaded while the user is

transitioning between other documents in the same application, and it is unloaded

when the user transitions to a document that is not in the application. While it is

loaded, the application root document’s variables are available to the other documents

as application variables, and its grammars can also be set to remain active for the

duration of the application.

The user is always in one conversational state, or dialog, at a time. Each dialog

determines which dialog will be transitioned to next. Transitions are specified using

33

URI (Uniform Resource Identifier), which define the next document and dialog to use.

If a URI does not refer to a document, the current document is assumed. If it does not

refer to a specific dialog, the first dialog in the document is assumed. The dialog

execution is terminated when a dialog does not specify a successor, or if it has an

element that explicitly exits the conversation.

Dialogs are of two kinds forms and menus. Forms define an interaction that collects

values for a set of field-item variables. Each field may specify a grammar that defines

the allowable inputs for that field. If a form-level grammar is present, it can be used to

fill several fields from one utterance. A menu presents the user with a choice of

options and then transitions to another dialog based on that choice.

Subdialogs are function-like reusable components that can be used for standard

reusable dialog interfaces, like collecting credit card numbers. At the end of execution

of a subdialog, the control returns to the dialog from where it was invoked and returns

the fields that were collected.

Grammar, Each dialog has one or more speech and/or DTMF grammars (valid

commands) associated with it. Each dialog’s grammars are active only when the user

is in that dialog. Some of the dialogs can be flagged to make their grammars active

(i.e., listened for) even when the user is in another dialog in the same document, or on

another loaded document in the same application. In this situation, if the user says

something matching another dialog’s active grammars, the application transitions to a

new dialog, and treats the user’s utterance as if it were said in the new dialog.

Events, VoiceXML provides a form-filling mechanism for handling "normal" user

input. In addition, VoiceXML defines a mechanism for handling events not covered by

the form mechanism. Events are thrown by the platform under a variety of

circumstances, such as when the user does not respond, doesn't respond intelligibly,

requests help, etc. The interpreter also throws events if it finds a semantic error in a

VoiceXML document. Catch elements or their syntactic shorthand catches events.

Each element may specify catch elements. Catch elements are also inherited from

34

enclosing elements "as if by copy." In this way, common event handling behavior can

be specified at any level, and applied to all lower levels.

VoiceXML implements a client-server paradigm, where a web server provides

VoiceXML documents that contain dialogs to be interpreted and presented to the user;

the user’s responses are submitted to the web server, which responds by providing

additional VoiceXML documents, as appropriate. VoiceXML allows to request

documents and submit data to server scripts using Universal Resource Indicators

(URIs). It provides an open application development environment that generates

portable applications. This makes VoiceXML a cost effective alternative for providing

voice access services. It directly supports networked and web-based applications,

meaning that a user at one location can access information or an application provided

by a server at another geographically or organizationally distant location. This

capitalizes on the connectivity and commerce potential of the World Wide Web.

4.3.2 Servlets

Java Servlets are the key component of server side programming. A servlet is a small

puggle extension to the server that enhances the servers functionality. Servlets are

server side programmes, which run on the web servers to provide the requested

information by the users. Servlets make use of JDBC concepts to connect to the

database where the actual information of the enterprise is stored

Advantage of Servlets Over CGI

Java servlets are more efficient, easier to use, more powerful, more portable, and

cheaper than traditional CGI and than many alternative CGI-like technologies. (More

importantly, servlet developers get paid more than Perl programmers :-).

(i) Efficient: With traditional CGI, a new process is started for each HTTP request. If

the CGI program does a relatively fast operation, the overhead of starting the process

can dominate the execution time. With servlets, the Java Virtual Machine stays up,

and each request is handled by a lightweight Java thread, not a heavyweight operating

system process.

35

(ii) Powerful: Java servlets let to easily do several things that are difficult or

impossible with regular CGI. For one thing, servlets can talk directly to the Web

server (regular CGI programs can't). Servlets can also share data among each other,

making useful thing like database connection pools easy to implement. They also

maintain information from request to request, simplifying things like session tracking

and caching of previous computations. Servlets are written in Java and follow a well-

standardized API. Servlets are supported directly or via a plugin on almost every

major Web server.

4.3.3 JSGF (Java speech grammar format)

JSGF provide speech recognition systems with the ability to listen to user speech and

determine what is said .The VoiceXML browser requires all grammars to be specified

using the Java Speech Grammar format.The Java™ Speech Grammar Format (JSGF)

defines a platform-independent, vendor-independent way of describing one type of

grammar, a rule. It uses a textual representation that is readable and editable by both

developers and computers, and can be included in Java source code.

Components of the grammar, the grammar header and the grammar body. The

grammar header declares the grammar name and lists the imported rules and

grammars. The grammar body defines the rules of this grammar as combinations of

speakable text and references to other rules.

A simple grammar header might be:

#JSGF V1.0;

grammar citystate;

Here citystate is the “grammar name”

The grammar body consists of one or more rules that define the valid set of utterances.

The syntax for grammar rules is: public < rulename> = options;

where: public is an optional declaration indicating that the rule can be used as an

active rule by the speech recognition engine.

rulename is a unique name identifying the grammar rule.

options can be any combination of text that the user can speak, another rule, and

delimiters such as:

36

| to separate alternatives

[] to enclose optional words, phrases, or rules () to group words, phrases, or rules

to indicate that the previous item may occur zero or more times + to indicate that the previous item may occur one or more times

For example:

#JSGF V1.0;

grammar employees;

public <name>= Jonathan | Larry | Susan | Melissa;

Inline grammar, which is specified directly in the VXML document.

<grammar>

request | path | query | server | remote user | backup | exit

</grammar>

VoiceXML browser also uses JSGF as the DTMF grammar format. For example, the

following code snippet defines an inline DTMF grammar that allows the user to make

a selection by pressing the numbers 1 through 4, the asterisk, or the pound sign on a

telephone:

<dtmf type=”text/x-jsgf”>

1 | 2 | 3 | 4 | “*” | “#”

</dtmf>

4.3.4 Oracle database

The overall information about the institute, student and the faculty database is stored

in the oracle database. Separate tables are created for students, faculty and regarding to

the institute information. Reason for selecting Oracle database as a source of data is its

simplicity. It is easy to create, update and delete the data tables using and SQL.

4.4 Speech interface design makes use of Prototype software model

The speech user interface should be presented such that it will easy for the user to hear

clearly the required information. A lot of effort is put to deal with the different aspects

of design.

37

4.4.1 Design Methodology

Developing speech user interfaces, like most development activities, involves an

iterative 4-phase process: “Design Phase”, “Prototype Phase”, “Test Phase”,

“Refinement Phase”.

Design Phase: In this phase, the goal is to define proposed functionality and create an

initial design. This involves the following tasks: “Analyzing Your Users”,

“Analyzing User Tasks” , “Making High-Level Decisions” , “Making Low-Level

Decisions” , “Defining Information Flow” , “Identifying Application Interactions” ,

“Planning for Expert Users” .

Prototype Phase: The goal of this phase is to create a prototype of the application,

leaving the design flexible enough to accommodate changes in prompts and dialog

flow in subsequent phases of the design.

For the first iteration, use the technique known as “Wizard of Oz” testing. This

technique can be used before beginning the coding, as it requires only a prototype

paper script and two people: one to play the role of the user, and a human “wizard” to

play the role of the computer system.

Test Phase: After incorporating the results of the “Wizard of Oz” testing, code and test

a working prototype of the application. During this phase, be sure to analyze the

behavior of both new and expert users.

Identifying Recognition Problems: After Test phase, note consistent recognition

problems. The most common cause of recognition problems is acoustic confusability

among the currently active phrases. Sometimes there is nothing one can do when this

happens. Other times one can try to correct the problem by: Using a synonym for one

of the terms. For example, if the system is confusing no and new. One should be able

to replace ‘new’ with ‘recent’ depending on the application’s context.

Refinement Phase: During this phase, update the user interface based on the results of

testing the prototype. For example, revise prototype scripts, add tapered prompts and

38

customizable expertise levels, create dialogs for inter- and intra-application

interactions, and prune out dialogs that were identified as potential sources of user

interface breakdowns. Finally, iterate the Design—Prototype—Test—Refine process,

including in the Test phase.

4.5 IVR Development Aspects

These are the files to be developed in order to build any IVR application.

(i) Create the necessary Vxml files to understand user input. Using JSGF create a

series of speech recognition grammars defining the words and phrases that can be

spoken by the user, and specifies where each grammar should be active within the

application.

(ii) Pass these parameters collected from the user to servlets by specifying the URI.

Uniform Resource Indicators are used to specify the path of the Servlet where it is

located. These URL’s are specified in the tag <submit> which submit the parameters to

the Servlets.

(iii). Create servlets using Servlet API to receive the parameters from the <submit>

tags of VXML files.

(iv). Use JDBC in Servlets to connect to database tables.

(v). Collect the information from the database and pass it through VXML tags like

<prompt> or <block>, which can read the text out loud.

Same procedure is applied to develop the code to process different options chosen by

the user. Special call recognizing tags are used in order to deploy the application in the

real time environment. In ordinary PSTN network central office is responsible for

generating the dial tone, establishing a connection between the source and destination

devices.

39

A gateway emulates a central office providing: Signaling - dial tone, call set-up etc.

(H.323, MGCP, SS7), Conversion to IP, (often Ethernet), Compression (G.711,

G.723.1 etc.), Echo Cancellation and Quality of Service (QOS).

When a user place a call using a telephone or cell phone to voice server. Voice server

automatically recognizes the call with the help of VOIP gateway and starts executing

the application-root document. User opts for a choice by hearing options provided by

the application. The speech recognition engine processes the incoming audio signal of

the user and compares the sound patterns to the patterns of basic spoken sounds, trying

to determine the most probable combination that represents the audio input. Finally,

the speech recognition engine compares the sounds to the list of words and phrases in

the active grammar(s). Only words and phrases in the active grammars are considered

as possible. With present technologies understanding long sentences is quiet difficult

when compared to small phrases.

4.6 Deployment Procedure

The following procedure is to be followed to run the application. Some of the

following pre-requirements should be satisfied to deploy the application.

Operating System: Windows NT 4.0.

Sound card should be properly configured.

A headphone with MIC is to be used in order to simulate it in the desktop

environment.

JRE1.2.2 provided by the IBM to configure the JVM.

The Voice server with configured audio setup as per the directions specified by

the system.

Java web server or any web server of your choice, which can run the servlets.

JDK1.3.1 .Set the path for the package in the system variable “path” and also

set the classpath for the package in the environment variable “classpath”. Al so

copy the files jar files like mail.jar, pop.jar, activation.jar, jsdk.jar in to the lib

directory of the JDK1.3.1 directory.

40

After the pre-requirements were met, copy all the class files in to the servlets directory

of the java web server. Copy all VXML files in to a folder named “Thesiscode” in “c”

logical partition of the hard disk.

Now in order to run the application root document in the desktop simulated

environment, open the command prompt. Go to the directory where the vxml

documents are stored. From there run the voice browser and execute the file by typing.

“path for voice browser i.e vsaudio” root_iiitm.vxml

Eg: If voice server is installed in the “C” partion then the path for voice browser will

be “c:\voices~1\bin\vsaudio” root_iiitm.vxml. To run the applications in the textmode

please replace the “vsaudio” by “vstext”. This is mostly is in debugging the

application. After executing this statement application starts executing in a user-

friendly manner so that the user can easily identify at which location he was in the

application.

4.6.1 How the system works to provide information

User who enters the system will first hear the warm welcome message. User will be

prompted the options form a menu .The menu will be having 7 different options

mentioned above. User has to choose one out of them. System will understand the

option, which the user opted for, whether it is one among the options that the system

has provided, or not. Defining the grammar for the options does this. One will know

how to develop the grammar in the coming chapters. After checking the grammar, if it

is correct option the system move to the new document or dialog specified If not

application generates different types of events which specify no match event or silent

event or help event and re prompt to select the proper option again. If the user is

unable to provide the options, try to get help regarding to selection of menu in much

more understandable manner. The system disconnects the call if the user fails to

provide input to the system. After the control is transferred to the new document or

dialogue, the user is provided with the necessary information. Again the user is

provided the same set of options provided earlier. If User wants to traverse through

other document, he can select a option again and can obtain the information regarding

to other aspects.

41

Eg: Suppose if one wants to know about information about institute establishment. He

can get it by selecting the establishment option after selecting the institute information

option first. After getting the institute establishment information. He will be again

given a set of choices to opt for. For further want information regarding to MTech

students selection, select the option students selection criteria and opt for the MTech

students choice in order to get that information.

Selecting first option provides information regarding to institute. In this the user is

given choices like what information the user like to have regarding to the institute

establishment, facilities, profile of the institute, Students database, Faculty database.

Provide information like student name, group and hear the complete details of that

particular student. Selecting the second option provides the information about the

recruitment process of the students and faculty in IIITM. Just select the group like

MTech, MBA, IPG in which the user is interested. Hear the recruitment procedure for

that particular branch. Selecting the third option announces the achievements of the

institute. It includes summer placement information, final placement information and

cultural events occurred in the institute every year. Selecting the fourth option gives

information of which student selected in which company for summers and finals. For

this the user has to supply the information, student name and group. To select the fifth

option first of all the should get registered to our system as a member. For this type

http://127.0.0.1:8080/registration.html in the Internet explorer or Netscape browser.

Fill up the information required and submit to the server the user recieves congrats

information along with a pin number, which is supplied to the system. This pin

number is of use in future to check your mails through our email reader. All the

members should have a pop mail account in any of the pop mail servers of yahoo,

hotmail etc. The pop mail account information like userid and password should be

given for further use by the system to connect to your pop mail account, get new mail

information and read the mails of intended for the user. Please supply your pin

number. User can hear from the system how many new mails he got and read the mail

in which he is interested.

42

4.6.2 Practical issues faced for deployment of the IVR system

(i) VOIP gateway: Developing VoIP gateway requires a lot of infrastructure like DSP

modules and developing the protocols like SIP and MGCP, which is practically

impossible to complete with in this short period of dissertation work.

(ii) Voice server: Voice server is equipped with a voice browser, TTS engine, and

speech recognition. As the time is short I was supposed to use voice server model

developed by IBM. But it was not flexible in its functioning as it was still under

developing stage. I make use of some of the functions incorporated in the voice server

to develop a voice application As VOIP is not available at present it becomes difficult

to adapt this application to PSTN. Hence I simulated it in the desktop environment.

(iii). Dealing with the Speech Recognition Errors: There are three basic types of

recognition errors. The speech recognition engine returns a result that does not match

what the user actually said. This can have many causes, including:

The audio quality is poor.

Multiple choices in the active grammars sound similar, such as “Newark” and

“New York” in a grammar of United States airports.

The user utterance was not in any of the active grammars, but something from

an active grammar sounded similar.

The user has a strong or unusual accent.

The user paused before finishing the intended utterance.

The speech recognition engine did not understand what the user said well

enough to return anything at all. This type of error can occur in situations

similar to those described above.

All the practical issues were taken in to consideration in developing the application.

4.6.3 Security issues taken into consideration in deploying the IVR

Security in voice applications can be implemented at two different levels. One is at the

infrastructure level, involving the telephony network and Internet infrastructure. Most

VoiceXML browsers support the existing Web security infrastructure. They support

SSL and cookies to help manage security between the voice server and the Web

43

server. Communications may be secured with authentication, encryption, and data

integrity measures using existing telephony security technologies. Second is at the

application level, which can be implemented in any of the following three ways:

(i). The user id/password approach in which the application prompts for a user id and

pin code. In most cases, the user is asked to key in the entries instead of speaking (to

avoid overhearing).

(ii). The telephone number identifies the user id. In this approach, the user simply

enters his pin code, reducing the complexity. It is implemented in this application.

Most of the VoiceXML interpreters can identify the incoming phone number.

(iii). Speech verification (Voice Biometrics) authenticates the user, excluding the need

of PIN based verification. Here, the voiceprint samples are stored in the database at

the time the account is set up, to be compared against at the time of authentication.

44

Chapter 5

5. Conclusion and Future scope of the work

5.1 Conclusion Interactive voice response system developed makes use of latest Speech recognition

engines to have the speech user interface efficient in recognizing the human voice. It

promises a friendly user interface as every stage of interaction, was designed carefully

and efficiently using a powerful voice language VXML.

IVR empowered users with more options regarding when, where, and how they use

Internet services. Using speech as the most natural form of communication, the

existing familiar global telephone network as the most pervasive communications

network, and enabling eyes and hands-free operation. This new mode of access

promises to further accelerate the growth and maturity of Internet services.

Improvement were made in the following aspects

5.1.1 Minimised fetching delays

Coding is made in the way that there involves minimum amount of 'dead air' the caller

hears while the system fetches resources. VoiceXML provides several facilities to

either eliminate or hide the delays associated with retrieving Web resources.To

minimize delays, the system maintains a cache for VoiceXML documents, audio files,

and other files used by applications. Normally, once the system has fetched a file over

the Internet, it keeps a copy in the cache. If the application requests the file again, the

system uses the cached copy. This is known as fast caching. Sometimes, even when a

file is in the cache, the user should always check for a newer version of the file on the

server from which it was originally fetched. This is known as safe caching.

45

5.1.2 New way of Grammar development

Collecting the data from the database develops grammar servlets. This is the most

efficient way of developing the grammar rules when compared to ordinary way of

implementing grammar making use of reusable components. Reusable components are

files, which specify the entire probable input from the user by specifying all the

combination of alphanumeric characters. Servlets are written to collect the required

data from the database and form a grammar file using the database information.A

thread is created which checks for the new data entered in to the data table. This thread

executes in every 10 seconds and forms the grammar files every time the table is

updated.

5.1.3 Email reader

Email reader makes the user simply dial a number from the telephone or cell phone

and listen to his emails. This is a cost effective and efficient way of checking the

emails especially to users who are always mobile.

5.2 Future Scope of the work

Interactive voice response websites as every one knows requires a lot of infrastructure

like developing speech recognition engines and voice servers. Mostly IVR

applications are to serve the untapped market of mobile and telephone users, which are

the cost effective ways of doing the transactions .To make it possible a VoIP gateway

is required. As it requires lot of time to develop, I simulated the application in the

desktop environment. Improvements can be made at various stages of the application

as mentioned below.

(i). One can develop a much more efficient user-friendly interface than the existing

one.

(ii). One can develop a VoIP gateway and make my dream true of deploying it in a

real time environment.

(iii). One can introduce much more sophisticated technologies in speech recognition

and make the process of speech recognition perfect than which exist now.

46

Abbreviations

1. VXML - Voice Extensible Markup Language.

2. VSDK - Voice Server Development Kit.

3. PSTN - Public Switching Telephone Network

4. IP - Internet Protocol.

5. DTMF - Dual Tone Multiple Frequency.

6. JSGF - Java Speech Grammar Format.

7. JSML - Java Speech Markup Language.

8.URI - Uniform Resource Indicator.

9.’Wizard of Oz’- A prototype model of IVR development.

List of Figures 1. Fig 4.1 - System components and data flow. - 24

2. Fig 2.1 - IVR network architecture of TJNET.- 13

3. Fig 3.1 - Voice Web architecture.- 17

4. Fig3.1.2 – VoIP gateway- 19

47

References

1. www. IBM /alphaworks.com.

2. www.Tellme.com.

3. www.heyanitafreespeech.com.

4. www.java.sun.com.

5. www.nuance.com.

6. www.voicexml.org.

7. Jason Hunter Java Servlet programming O’Reilly Publications.

8. Sameer Tyagi Professional WAP Wrox publications

9. Joseph O’Neil Teach your self Java Tata McGraw Hill Publications.

48

Date post:	11-Apr-2015
Category:	Documents
Upload:	api-3838124
View:	455 times
Download:	0 times

IT06 IVR Thesis Report(Pratap Raju)

Documents