PROJECT REPORT 2015-2016 - KSCST · PROJECT REPORT 2015-2016 ... 3.Sound Card (Qauntum) ... Speech...

PROJECT REPORT 2015-2016

NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

(An Autonomous Institute, Affiliated to Visvesvaraya Technological University,

Belgaum, Approved by AICTE & Govt. of Karnataka)

ACADEMIC YEAR 2015-16

Final Year Project Report on

“Providing voice enabled gadget assistance to inmates of old age home (vridddhashrama) including physically disabled people.”

Submitted in partial fulfillment of the requirement for the award of the degree of

BACHELOR OF ENGINEERING

Submitted by

ABHIRAMI BALARAMAN (1NT12EC004) AKSHATHA P(1NT12EC013)

INTERNAL GUIDE :

"MS. KUSHALATHA M R"

(Assistant Professor)

Department of Electronics and Communication Engineering NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

Yelahanka, Bangalore-560064

DEPARTMENT OF ECE, NMIT


NITTE MEENAKSHI INSTITUTE OF TECHNOLOGY

(An Autonomous Institute, Affiliated to VTU, Belgaum, Approved by AICTE &

State Govt. of Karnataka), Yelahanka, Bangalore-560064

Department Of Electronics And Communication Engineering

CERTIFICATE

Certified that the Project Work entitled “Providing voice enabled gadget assistance to inmates of old age home(vridddhashrama) including physically disabled people.”

guided by IISc was carried out by Abhirami Balaraman (1NT12EC004), Akshatha.P(1NT12EC013) bonafide students of Nitte Meenakshi Institute of Technology in

partial fulfillment for the award of Bachelor of Engineering in Electronics and Communication of the Visvesvaraya Technological University, Belgaum during the academic

year 2015-2016.The project report has been approved as it satisfies the academic requirement in respect of project work for completion of autonomous scheme of Nitte

Meenakshi Institute of Technology for the above said degree.

Signature of the Guide Signature of the HOD

(Ms.Kushalatha M R) (Dr. S. Sandya)

External Viva

Name of the Examiners Signature with Date

……………………………… …............................................



ACKNOWLEDGEMENT

We express our deepest thanks to our principal Dr. H. C. Nagaraj and Dr N. R. Shetty,

Director Nitte Meenakshi Institute of Technology, Bangalore for allowing us to carry

out the industrial training and supporting us throughout.

We also thank Indian Institute of Science for giving us the opportunity to carry out our

internship project in their esteemed instituition & giving us all the support we need to

carry on the idea as our final year project.

We express our deepest thanks to Dr.Rathna G N for taking part in useful decision ,

guidance and necessary equipment for the project and progresssing it to our final year

project. We choose this moment to acknowledge her contribution gratefully.

We also express our deepest thanks to our HOD Dr.S.Sandya for allowing us to carry

our industrial training and helped us in all the way so that we could gain a practical

experience of the industry. We also take this opportunity to thank Ms. Kushalatha M R [Asst. Prof, ECE Dept.] for

guiding us in the right path and being of immense help to us.

Finally we thank all other unnamed who helped us in various ways to gain knowledge

and have a good training.



ABSTRACT

Speech recognition is one of the most recently developing field of research at both industrial

and scientific levels. Until recently, the idea of holding a conversation with a computer seemed

pure science fiction. If you asked a computer to “open the pod bay doors”—well, that was only

in movies. But things are changing, and quickly. A growing number of people now talk to their

mobile smart phones, asking them to send e-mail and text messages, search for directions, or

find information on the Web. Our Project aims at one such application. Project was designed

keeping in mind ,the various categories of people who suffer from loneliness due to absence of

others to care for them,especially the ones who are under cancer treatment and old aged

people.The system will provide interaction and entertainment and control appliances such as

television on voice commands.



LITERATURE SURVEY

Books are available to read and learn about speech recognition ,these enabled us to

see what happens beyond the code.

Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++ Implementation “

2008 edition, In this book we learned how to write and implement the c++ code.

A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" by

Homayoon Beigi, is an in depth source for up to date details on the theory and practice.

A good insight into the techniques used in the best modern systems can be gained by

paying attention to government sponsored evaluations such as those organised by

DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE

project, which involves both speech recognition and translation components).

"Automatic Speech Recognition: A Deep Learning Approach" (Publisher: Springer)

written by D. Yu and L. Deng published near the end of 2014, with highly

mathematically-oriented technical detail on how deep learning methods are derived and

implemented in modern speech recognition systems based on DNNs and related deep

learning methods.This gave us an insight into the conversion algorithm used by Google.

Here are some IEEE and other articles we referred :

Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time-delay

neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."

Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker

identification using Gaussian mixture speaker models" (PDF). IEEE Transactions on

Speech and Audio Processing (IEEE) 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063-

6676. OCLC 26108901. Retrieved 21 February 2014.



SURVEY QUESTIONNNAIRE CONDUCTED IN OLD AGE HOME:

1.What is the total number of people in this old age

home? Ans. There are 22 people of age above 70.

2.What are the facilities available to you?

Ans. All basic needs are provided.

3. Is 24/7 medical assistance for someone who's bed-

ridden? Ans. No 24/7 nursing.

4. What are the technological facilities provided at your organization for entertainment

purpose? Ans. There was only television in each room.

5. Do you have access to computers,internet and mobile phones at your organization?

Ans. No, we are not aware of using all of it. Also it is expensive.

6. What are the changes you would like to have in your daily rountine?

Ans. The rountine is monotonous, so we like means to pass time, like playing games,

learning anything based on our intrest.

7. Do you think our project is helpful to you?

Ans Yes. It provides us entertainment and keeps us engaged and not feel bored.

Speech activation is very helpful for us as it easy for us to use,especially tv.

8. Any suggestions?

Ans. Add books,scriptures since our eyes gets weak with age. Add quiz games so

that we can improve our knowledge. We need something that can train us to

learning new languages or anything based on our intrest, without using internet.



CONTENTS

1.INTRODUCTION ..................................................................................................................................... 8

2. OUR OBJECTIVE ............................................................................................................................... 11

3.SYSTEM REQUIREMENTS ............................................................................................................ 13

3.1 HARDWARE COMPONENTS ............................................................................................... 13

3.2 SOFTWARE REQUIRED .......................................................................................................... 16

4.IMPLEMENTATION ............................................................................................................................. 19

4.1 ALGORITHMS ............................................................................................................................... 19

4.2 SETTING UP RASPBERRY PI .............................................................................................. 22

4.3 DOWNLOADING OTHER SOFTWARE ............................................................................ 43

4.4 SETTING UP LIRC ...................................................................................................................... 46

4.5 WORKING OF IR LED ............................................................................................................... 50

4.6 FLOWCHART .................................................................................................................................. 51

4.7 BLOCK DIAGRAM ....................................................................................................................... 52

5.FURTHER ENHANCEMENTS ........................................................................................................ 53

6.APPLICATIONS .................................................................................................................................... ,55

7.REFERENCES …...............................................................................................................59



LIST OF FIGURES

1.Block diagram of WATSON recognition system ..................................................................... 2

2.Raspberry Pi model B .............................................................................................................................. 13

3.Sound Card (Qauntum) ............................................................................................................................ 14

4.Collar Mic ........................................................................................................................................................ 15

5.IR LED ............................................................................................................................................................... 18

6.IR Receiver ...................................................................................................................................................... 18

7.PN2222 .............................................................................................................................................................. 19

8.The Raspbian Desktop ............................................................................................................................. 20

9.Jasper client .................................................................................................................................................. 21

10.Schematic..................................................................................................................................................... 48

11.Flowchart ...................................................................................................................................................... 51

12.Block Diagram of System ................................................................................................................... 52

13.GSM Quadband 800A ............................................................................................................................ 53

14.Home automation possibilities ....................................................................................................... 54

15.Car automation ......................................................................................................................................... 56



CHAPTER 1

INTRODUCTION TO SPEECH RECOGNITION

In computer science and electrical engineering, speech recognition (SR) is the

translation of spoken words into text. It is also known as "automatic speech recognition"

(ASR), "computer speech recognition", or just "speech to text" (STT).

Some SR systems use "training" (also called "enrolment") where an individual speaker

reads text or isolated vocabulary into the system. The system analyzes the person's

specific voice and uses it to fine-tune the recognition of that person's speech, resulting

in increased accuracy. Systems that do not use training are called "speaker

independent"[1] systems. Systems that use training are called "speaker dependent".

Speech recognition applications include voice user interfaces such as voice dialling

(e.g. "Call home"), call routing (e.g. "I would like to make a collect call"), domotic

appliance control, search (e.g. find a podcast where particular words were spoken),

simple data entry (e.g., entering a credit card number), preparation of structured

documents (e.g. a radiology report), speech-to-text processing (e.g., word processors or

emails), and aircraft (usually termed Direct Voice Input).

The term voice recognition[2][3][4] or speaker identification[5][6] refers to identifying the

speaker, rather than what they are saying. Recognizing the speaker can simplify the task of

translating speech in systems that have been trained on a specific person's voice or it can

be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of



major innovations. Most recently, the field has benefited from advances in deep learning

and big data. The advances are evidenced not only by the surge of academic papers

published in the field, but more importantly by the world-wide industry adoption of a

variety of deep learning methods in designing and deploying speech recognition

systems. These speech industry players include Microsoft, Google, IBM, Baidu (China),

Apple, Amazon, Nuance, IflyTek (China), many of which have publicized the core

technology in their speech recognition systems being based on deep learning.

Fig 1.WATSON block diagram

Now the rapid rise of powerful mobile devices is making voice interfaces even more

useful and pervasive.

Jim Glass, a senior research scientist at MIT who has been working on speech interfaces

since the 1980s, says today’s smart phones pack as much processing power as the

laboratory machines he worked with in the ’90s. Smart phones also have high-bandwidth

data connections to the cloud, where servers can do the heavy lifting involved with both

voice recognition and understanding spoken queries. “The combination of more data and

more computing power means you can do things today that you just couldn’t do before,”

says Glass. “You can use more sophisticated statistical models.”

The most prominent example of a mobile voice interface is, of course, Siri, the voice-

activated personal assistant that comes built into the latest iPhone. But voice functionality is

built into Android, the Windows Phone platform, and most other mobile systems, as well as

many apps. While these interfaces still have considerable limitations (see Social

Intelligence), we are inching closer to machine interfaces we can actually talk to.



In 1971, DARPA funded five years of speech recognition research through its Speech

Understanding Research program with ambitious end goals including a minimum

vocabulary size of 1,000 words. BBN. IBM., Carnegie Mellon and Stanford Research

Institute all participated in the program.[11] The government funding revived speech

recognition research that had been largely abandoned in the United States after John

Pierce's letter. Despite the fact that CMU's Harpy system met the goals established at the

outset of the program, many of the predictions turned out to be nothing more than hype

disappointing DARPA administrators. This disappointment led to DARPA not continuing the

funding.[12] Several innovations happened during this time, such as the invention of beam

search for use in CMU's Harpy system.[13] The field also benefited from the discovery of

several algorithms in other fields such as linear predictive coding and cepstral analysis.



CHAPTER 2

OUR OBJECTIVE

Providing information and entertainment,to otherwise solitary people,hence acts as a

personal assistant.

People with disabilities can benefit from speech recognition programs. For individuals

that are Deaf or Hard of Hearing, speech recognition software is used to automatically

generate a closed-captioning of conversations such as discussions in conference

rooms, classroom lectures, and/or religious services.[4]

Speech recognition is also very useful for people who have difficulty using their hands, ranging

from mild repetitive stress injuries to involved disabilities that preclude using conventional

computer input devices. In fact, people who used the keyboard a lot and developed RSI became

an urgent early market for speech recognition.[6] Speech recognition is used in deaf telephony,

such as voicemail to text, relay services, and captioned telephone. Individuals with learning

disabilities who have problems with thought-to-paper communication (essentially they think of

an idea but it is processed incorrectly causing it to end up differently on paper) can possibly

benefit from the software but the technology is not bug proof.[7] Also the whole idea of speak to

text can be hard for intellectually disabled person's due to the fact that it is rare that anyone tries

to learn the technology to teach the person with the disability.[8]

Being bedridden can be very difficult for many patients to adjust to and it can also cause other

health problems as well. It is important for family caregivers to know what to expect so that they

can manage or avoid the health risks that bedridden patients are prone to. In this article we

would like to offer some information about common health risks of the bedridden patient and

some tips for family caregivers to follow in order to try and prevent those health risks.

Depression is also a very common health risk for those that are bedridden because they are

unable to care for themselves and maintain the social life that they used to have. Many seniors

begin to feel hopeless when they become bedridden but this can be prevented with proper care.

Family caregivers should make sure that they are caring for their loved one’s social and

emotional needs as well as their physical needs. Many family caregivers focus only on the

physical needs of their loved ones and forget that they have emotional and social needs as well.

Family caregivers can help their loved ones by providing them with regular social activities and

arranging times for friends and other family members to come over so that they will not feel DEPARTMENT OF ECE, NMIT

PROJECT REPORT 2015-2016 lonely and forgotten. Family caregivers can also remind their loved ones that being bedridden

does not necessarily mean that they have to give up everything they used to enjoy.[10]

But since family members wont always be available at home,the above mentioned

problems are still prevalent in these patients,hence our interactive system will provide

them with entertainment (music,movies),and voice responses to general questions.

Therefore it behaves as an electronic companion.



CHAPTER 3

SYSTEM REQUIREMENTS

The project needs both hardware and software components.The hardware components

includes,the Raspberry Pi model B ,keyboard,mouse,earphones,microphone with sound

card,ethernet cable, HDMI screen and HDMI cable.Software components are Rasbian OS

on SD card,C++ compiler and the online resourses google speech API and Wolfram alpha

.They are described in detail below.

3.1 HARDWARE COMPONENTS

1. RASPBERRY PI MODEL B

The Raspberry Pi is a series of credit card–sized single-board computers developed in

the UK by the Raspberry Pi Foundation with the intention of promoting the teaching of

basic computer science in schools.[5][6][7]

The system is developed through ARM microprocessor ARM is a registered

trademark of ARM Limited. Linux now provides support for the ARM-11 family processors; it

gives consumer device manufacturers, commercial-quality Linux implementation along with

tools to reduce time-to-market and development costs. Raspberry Pi is a credit card sized

computer development platform based on a BCM2835 system on chip, sporting an ARM11

processor, developed in the UK by Raspberry Pi Foundation. Raspberry Pi model functions

as a regular desktop computer when it is connected to the keyboard or monitor. Raspberry

Pi is very cheap and most reliable to make a Raspberry Pi supercomputer. The Raspberry

Pi uses Linux kernel-based.

The Foundation provides Debian and Arch Linux ARM distributions for download.Tools are

available for Python as the main programming language, with support for BBC BASIC(via

the RISC OS image or the Brandy Basic clone for Linux), C, C++, Java,Perl and Ruby.

Fig 2.Raspberry Pi Model B



Specifications include

SoC:Broadcom BCM2835 (CPU, GPU, DSP, SDRAM, one USB port)

CPU:700 MHz single-core ARM1176JZF-S

GPU:Broadcom VideoCore IV @ 250 MHz

OpenGL ES 2.0 (24 GFLOPS)

MPEG-2 and VC-1 (with license),1080p30 H.264/MPEG-4 AVC high-profile decoder

and encoder

Memory (SDRAM):512 MB (shared with GPU) as of 15 October 2012

USB 2.0 ports :2 (via the on-board 3-port USB hub)

Video outputs:HDMI (rev 1.3 & 1.4), 14 HDMI resolutions from 640×350 to 1920×1200

plus various PAL and NTSC standards, composite video (PAL and NTSC) via RCA jack

Audio outputs:Analog via 3.5 mm phone jack; digital via HDMI and, as of revision 2 boards, I²S

On-board storage:[SD / MMC / SDIO card slot

On-board network:[11]10/100 Mbit/s Ethernet (8P8C) USB adapter on the third/fifth port

of the USB hub (SMSC lan9514-jzx)[42]

Low -level peripherals:8× GPIO plus the following, which can also be used as GPIO:

UART, I²C bus, SPI bus with two chip selects, I²S audio +3.3 V, +5 V, ground.

The Power ratings:700 mA (3.5 W)

Power source:5 V via MicroUSB or GPIO header

Size:85.60 mm × 56.5 mm (3.370 in × 2.224 in) – not including protruding connectors

Weight:45 g (1.6 oz)

The main differences between the two flavours of Pi are the RAM, the number of USB 2.0 ports and

the fact that the Model A doesn’t have an Ethernet port (meaning a USB Wi-Fi is required to access

the internet. While that results in a lower price for the Model A, it means that a user will have to buy

a powered USB hub in order to get it to work for many projects. The Model A is aimed more at those

creating electronics projects that require programming and control directly from the command line

interface. Both Pi models use the Broadcom BCM2835 CPU, which is an ARM11-based processor

running at 700MHz. There are overclocking modes built in for users to DEPARTMENT OF ECE, NMIT


increase the speed as long as the core doesn’t get too hot, at which point it is throttled back.

Also included is the Broadcom VideoCore IV GPU with support for OpenGL ES 2.0, which SD

can perform 24 GFlops and decode and play H.264 video at 1080p resolution. Originally the

Model A was due to use 128MB RAM, but this was upgraded to 256MB RAM with the Model B

going from 256MB to 512MB. The power supply to the Pi is via the 5V microUSB socket. As the

Model A has fewer powered interfaces it only requires 300mA, compared to the 700mA that the

Model B needs. The standard system of connecting the Pi models is to use the HDMI port to

connect to an HDMI socket on a TV or a DVI port on a monitor. Both HDMI-HDMI and HDMI-

DVI cables work well, delivering 1080p video, or 1920x1080. Sound is also sent through the

HDMI connection, but if using a monitor without speakers then there’s the standard 3.5mm jack

socket for audio. The RCA composite video connection was designed for use in countries where

the level of technology is lower and more basic displays such as older TVs are used.

2. SOUND CARD WITH MICROPHONE

Sound card is used since raspberry pi has no on board ADC ,

A sound card (also known as an audio card) is an internal computer expansion card

that facilitates economical input and output of audio signals to and from a computer

under control of computer programs. The term sound card is also applied to external

audio interfaces that use software to generate sound, as opposed to using hardware

inside the PC. Typical uses of sound cards include providing the audio component for

multimedia applications such as music composition, editing video or audio,

presentation, education and entertainment (games) and video projection.

Sound functionality can also be integrated onto the motherboard, using components similar

to plug-in cards. The best plug-in cards, which use better and more expensive components,

can achieve higher quality than integrated sound. The integrated sound system is often still

referred to as a "sound card". Sound processing hardware is also present on modern video

cards with HDMI to output sound along with the video using that connector; previously they

used a SPDIF connection to the motherboard or sound card.



We are using Quantum sound card and hauwei collar mic.

A microphone, colloquially nicknamed mic or mike (/ˈmaɪk/),[1] is an acoustic-to-electric transducer

or sensor that converts sound into an electrical signal. Electromagnetic transducers facilitate the

conversion of acoustic signals into electrical signals.[2] Microphones are used in many applications

such as telephones, hearing aids, public address systems for concert halls and public events, motion

picture production, live and recorded audio engineering, two-way radios, megaphones, radio and

television broadcasting, and in computers for recording voice, speech recognition, VoIP, and for non-

acoustic purposes such as ultrasonic checking or knock sensors.

Most microphones today use electromagnetic induction (dynamic microphones),

capacitance change (condenser microphones) or piezoelectricity (piezoelectric

microphones) to produce an electrical signal from air pressure variations. Microphones

typically need to be connected to a preamplifier before the signal can be amplified with

an audio power amplifier and a speaker or recorded.

Fig 3.Sound Card Fig 4. Collar Mic

3.KEYBOARD ,MOUSE AND HDMI SCREEN Are the other peripherals.


PROJECT REPORT 2015-2016 4. 940nm IR LED 20deg - 20 degree viewing angle. Bright and tuned to 940nm wavelength and

940nm IR LED 40deg - 40 degree viewing angle. Bright and tuned to 940nm wavelength.

Fig 5. IR Transmitter LED

5. 38 khz IR RECEIVER - Receives IR signals at remote control frequencies.

It is a photo detector and preamplifier in one package, high photo sensitivity, improved inner

shielding against electrical field disturbance, low power consumption, Suitable burst length≧10

cycles/burst, TTL and CMOS compatibility, improved immunity against ambient light, Internal

filter for PCM frequency. Bi-CMOS manufacture IC; ESD HBM>4000V; MM>250V

It is miniaturized receivers for infrared remote control systems with the high speed PIN

phototransistor and the full wave band preamplifier. Some of its applications are: Infrared

applied system, Light detecting portion of remote control, AV instruments such as Audio,

TV, VCR, CD, MD, etc. ,CATV set top boxes, other equipments with wireless remote

control, Home appliances such as Air-conditioner, Fan, etc. Multi-media Equipment.

Fig 6. IR Receiver



6. PN2222 TRANSISTOR - Transistor here is used to help drive the IR LED. Each

transistor is a general purpose amplifier, model PN2222 and has a standard EBC pin

out. They can switch up to 40V at peak currents of 1A, with a DC gain of about 100.

A similar transistor is used with same current rating.KSP2222.

Fig 8.PN2222 Pinout

7.10k Ohm RESISTOR- Resistor that goes between rPi GPIO and the PN2222 transistor

Breadboard.



3.2 SOFTWARE REQUIRED

1. RASPBIAN OS

Although the Raspberry Pi’s operating system is closer to the Mac than

Windows, it’s the latter that the desktop most closely resembles

It might seem a little alien at first glance, but using Raspbian is hardly any different to

using Windows (barring Windows 8 of course). There’s a menu bar, a web browser,

a file manager and no shortage of desktop shortcuts of pre-installed applications.

Raspbian is an unofficial port of Debian Wheezy armhf with compilation settings

adjusted to produce optimized "hard float" code that will run on the Raspberry Pi. This

provides significantly faster performance for applications that make heavy use of floating

point arithmetic operations. All other applications will also gain some performance

through the use of advanced instructions of the ARMv6 CPU in Raspberry Pi.

Although Raspbian is primarily the efforts of Mike Thompson (mpthompson) and Peter Green

(plugwash), it has also benefited greatly from the enthusiastic support of Raspberry Pi

community members who wish to get the maximum performance from their device.

Fig 9.The Raspbian Desktop



2.JASPER CLIENT

Jasper is an open source platform for developing always -on, voice-controlled applications Use

your voice to ask for information, update social networks, control your home, and more. Jasper

is always on, always listening for commands, and you can speak from meters away. Build it

yourself with off-the-shelf hardware, and use our documentation to write your own modules.

Fig 10.Jasper client

3. CMU Sphinx

CMUSphi nx http:/ /c mu sph i nx. sou rc ef o r ge .net collects over 20 years of the CMU

research. All advantages are hard to list, but just to name a few:

State of art speech recognition algorithms for eficient speech recognition.

CMUSphinx tools are designed specifically for low-resource platforms

Flexible design

Focus on practical application development and not on research



Support for several languages like US English, UK English, French,

Mandarin, German, Dutch, Russian and ability to build a models for others

BSD-like license which allows commercial distribution

Commercial support

Active development and release schedule

Active community (more than 400 users on Linkedin CMUSphinx group) Wide

range of tools for many speech-recognition related purposes (keyword spotting,

alignment, pronuncation evaluation)

CMU Sphinx, also called Sphinx in short, is the general term to describe a group of speech

recognition systems developed at Carnegie Mellon University. These include a series of

speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).

In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech

recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001). The speech

decoders come with acoustic models and sample applications. The available resources

include in addition software for acoustic model training, Language model compilation and a

public-domain pronunciation dictionary, cmudict.

Here , we use the pocketsphinx tool.

A version of Sphinx that can be used in embedded systems (e.g., based on an ARM

processor). PocketSphinx is under active development and incorporates features such

as fixed-point arithmetic and eficient algorithms for GMM computation.



4. WinSCP

WinSCP (Windows Secure Copy) is a free and open-source SFTP, FTP, WebDAV and

SCP client for Microsoft Windows. Its main function is secure file transfer between a local

and a remote computer. Beyond this, WinSCP ofers basic file manager and file

synchronization functionality. For secure transfers, it uses Secure Shell (SSH) and supports

the SCP protocol in addition to SFTP.[3]

Development of WinSCP started around March 2000 and continues. Originally it was

hosted by the University of Economics in Prague, where its author worked at the time.

Since July 16, 2003, it is licensed under the GNU GPL and hosted on SourceForge.net.[4]

WinSCP is based on the implementation of the SSH protocol from PuTTY and FTP

protocol from FileZilla.[5] It is also available as a plugin for Altap Salamander file

manager,[6] and there exists a third-party plugin for the FAR file manager.[7]

5.PUTTY

PuTTY is a free and open-source terminal emulator, serial console and network file transfer

application. It supports several network protocols, including SCP, SSH, Telnet, rlogin, and

raw socket connection. It can also connect to a serial port (since version 0.59). The name

"PuTTY" has no definitive meaning.[3]



PuTTY was originally written for Microsoft Windows, but it has been ported

to various other operating systems. Oficial ports are available for some

Unix-like platforms, with work-in-progress ports to Classic Mac OS and Mac OS X, and unoficial ports have been contributed to platforms such as Symbian, [4][5]

Windows Mobile and Windows Phone.

PuTTY was written and is maintained primarily by Simon Tatham and is currently beta

software.

6. LIRC:

LIRC (Linux Infrared remote control) is an open source package that allows users to

receive and send infrared signals with a Linux-based computer system. There is a

Microsoft Windows equivalent of LIRC called WinLIRC. With LIRC and an IR receiver

the user can control their computer with almost any infrared remote control (e.g. a TV

remote control). The user may for instance control DVD or music playback with their

remote control. One GUI frontend is KDELirc, built on the KDE libraries.

7.Python 2.7

Python is a widely used high-level, general-purpose, interpreted, dynamic

programming language.[3][4] Its design philosophy emphasizes code readability, and its

syntax allows programmers to express concepts in fewer lines of code than would be

possible in languages such as C++ or Java.[5][6] The language provides constructs

intended to enable clear programs on both a small and large scale.[7]

Python supports multiple programming paradigms, including object-oriented, imperative and

functional programming or procedural styles. It features a dynamic type system and

automatic memory management and has a large and comprehensive standard library.[8]



Python interpreters are available for installation on many operating systems, allowing Python

code execution on a wide variety of systems. Using third-party tools, such as Py2exe or

Pyinstaller,[29] Python code can be packaged into stand-alone executable programs for some

of the most popular operating systems, allowing the distribution of Python-based software for

use on those environments without requiring the installation of a Python interpreter.

CPython, the reference implementation of Python, is free and open-source software and

has a community-based development model, as do nearly all of its alternative

implementations. CPython is managed by the non-profit Python Software Foundation.

Why python 2.7?

If you can do exactly what you want with Python 3.x, great! There are a few minor downsides,

such as slightly worse library support1 and the fact that most current Linux distributions and

Macs are still using 2.x as default, but as a language Python 3.x is definitely ready. As long as

Python 3.x is installed on your user's computers (which ought to be easy, since many people

reading this may only be developing something for themselves or an environment they control)

and you're writing things where you know none of the Python 2.x modules are needed, it is an

excellent choice. Also, most linux distributions have Python 3.x already installed, and all have it

available for end-users. Some are phasing out Python 2 as preinstalled default.



CHAPTER 4

IMPLEMENTATION

Both acoustic modeling and language modeling are important parts of modern statistically-

based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in

many systems. Language modeling is also used in many other natural language processing

applications such as document classification or statistical machine translation.

4.1 ALGORITHMS

HMM

Modern general-purpose speech recognition systems are based on Hidden Markov

Models. These are statistical models that output a sequence of symbols or quantities.

HMMs are used in speech recognition because a speech signal can be viewed as a

piecewise stationary signal or a short-time stationary signal. In a short time-scale (e.g.,

10 milliseconds), speech can be approximated as a stationary process. Speech can be

thought of as a Markov model for many stochastic purposes.

Another reason why HMMs are popular is because they can be trained automatically and

are simple and computationally feasible to use. In speech recognition, the hidden Markov

model would output a sequence of n-dimensional real-valued vectors (with n being a small

integer, such as 10), outputting one of these every 10 milliseconds. The vectors would

consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short

time window of speech and decorrelating the spectrum using a cosine transform, then

taking the first (most significant) coefficients. The hidden Markov model will tend to have in

each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which

will give a likelihood for each observed vector. Each word, or (for more general speech

recognition systems), each phoneme, will have a different output distribution; a hidden

Markov model for a sequence of words or phonemes is made by concatenating the

individual trained hidden Markov models for the separate words and phonemes.


PROJECT REPORT 2015-2016 Described above are the core elements of the most common, HMM-based approach to speech

recognition. Modern speech recognition systems use various combinations of a number of

standard techniques in order to improve results over the basic approach described above. A

typical large-vocabulary system would need context dependency for the phonemes (so

phonemes with different left and right context have different realizations as HMM states); it

would use cepstral normalization to normalize for different speaker and recording conditions; for

further speaker normalization it might use vocal tract length normalization (VTLN) for male-

female normalization and maximum likelihood linear regression (MLLR) for more general

speaker adaptation. The features would have so-called delta and delta-delta coefficients to

capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis

(HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based

projection followed perhaps by heteroscedastic linear discriminant analysis or a global semi-tied

co variance transform (also known as maximum likelihood linear transform, or MLLT). Many

systems use so-called discriminative training techniques that dispense with a purely statistical

approach to HMM parameter estimation and instead optimize some classification-related

measure of the training data. Examples are maximum mutual information (MMI), minimum

classification error (MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with

a new utterance and must compute the most likely source sentence) would probably

use the Viterbi algorithm to find the best path, and here there is a choice between

dynamically creating a combination hidden Markov model, which includes both the

acoustic and language model information, and combining it statically beforehand (the

finite state transducer, or FST, approach).

A possible improvement to decoding is to keep a set of good candidates instead of just

keeping the best candidate, and to use a better scoring function (re scoring) to rate these

good candidates so that we may pick the best one according to this refined score. The set

of candidates can be kept either as a list (the N-best list approach) or as a subset of the

models (a lattice). Re scoring is usually done by trying to minimize the Bayes risk[7] (or an

approximation thereof): Instead of taking the source sentence with maximal probability, we

try to take the sentence that minimizes the expectancy of a given loss function with regards

to all possible transcriptions (i.e., we take the sentence that minimizes the average distance

to other possible sentences weighted by their estimated probability).



The loss function is usually the Levenshtein distance, though it can be different

distances for specific tasks; the set of possible transcriptions is, of course, pruned to

maintain tractability. Efficient algorithms have been devised to re score lattices

represented as weighted finite state transducers with edit distances represented

themselves as a finite state transducer verifying certain assumptions.[8]

DEEP NEURAL NETWORK

A deep neural network (DNN) is an artificial neural network with multiple hidden layers of

units between the input and output layers.[6] Similar to shallow neural networks, DNNs can

model complex non-linear relationships. DNN architectures generate compositional models,

where extra layers enable composition of features from lower layers, giving a huge learning

capacity and thus the potential of modeling complex patterns of speech data.[6] The DNN is

the most popular type of deep learning architectures successfully used as an acoustic

model for speech recognition since 2010.

The success of DNNs in large vocabulary speech recognition occurred in 2010 by industrial

researchers, in collaboration with academic researchers, where large output layers of the DNN

based on context dependent HMM states constructed by decision trees were adopted.[7][8] [9]

One fundamental principle of deep learning is to do away with hand-crafted feature

engineering and to use raw features. This principle was first explored successfully in the

architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features,[2]

showing its superiority over the Mel-Cepstral features which contain a few stages of fixed

transformation from spectrograms. The true "raw" features of speech, waveforms, have

more recently been shown to produce excellent larger-scale speech recognition results.[3]

Since the initial successful debut of DNNs for speech recognition around 2009-2011,

there have been huge new progresses made. This progress (as well as future

directions) has been summarized into the following eight major areas:[8]



Scaling up/out and speedup DNN training and decoding;

Sequence discriminative training of DNNs;

Feature processing by deep models with solid understanding of the underlying mechanisms;

Adaptation of DNNs and of related deep models;

Multi-task and transfer learning by DNNs and related deep models;

Convolution neural networks and how to design them to best exploit domain knowledge

of speech;

Recurrent neural network and its rich LSTM variants;

Other types of deep models including tensor-based models and integrated deep

generative/discriminative models. Large-scale automatic speech recognition is the first and the most convincing successful case of

deep learning in the recent history, embraced by both industry and academic across the board.

Between 2010 and 2014, the two major conferences on signal processing and speech recognition,

IEEE-ICASSP and Interspeech, have seen near exponential growth in the numbers of accepted

papers in their respective annual conference papers on the topic of deep learning for speech

recognition. More importantly, all major commercial speech recognition systems (e.g., Microsoft

Cortana, Xbox, Skype Translator, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a

range of Nuance speech products, etc.) nowadays are based on deep learning methods.[5]



4.2 STEPS TO SETUP RASPBERRY PI

1.1. Connecting Everything Together

1. Plug the preloaded SD Card into the RPi. 2. Plug the USB keyboard and mouse into the RPi, perhaps via a USB hub. Connect the

Hub to power, if necessary. 3. Plug a video cable into the screen (TV or monitor) and into the RPi. 4. Plug your extras into the RPi (USB WiFi, Ethernet cable, external hard drive etc.).

This is where you may really need a USB hub. 5. Ensure that your USB hub (if any) and screen are working. 6. Plug the power supply into the mains socket. 7. With your screen on, plug the power supply into the RPi microUSB socket. 8. The RPi

should boot up and display messages on the screen.

It is always recommended to connect the MicroUSB power to the unit last (while most

connections can be made live, it is best practice to connect items such as displays with

the power turned off).

1.2. Operating System SD Card

As the RPi has no internal mass storage or built-in operating system it requires an SD

card preloaded with a version of the Linux Operating System.

• You can create your own preloaded card using any suitable SD card (4GBytes or

above) you have to hand. We suggest you use a new blank card to avoid arguments

over lost pictures. • Preloaded SD cards will be available from the RPi Shop.

1.3. Keyboard & Mouse

Most standard USB keyboards and mice will work with the RPi. Wireless keyboard/mice should

also function, and only require a single USB port for an RF dongle. In order to use a Bluetooth

keyboard or mouse you will need a Bluetooth USB dongle, which again uses a single port.

Remember that the Model A has a single USB port and the Model B has two (typically a

keyboard and mouse will use a USB port each).



1.4. Display

There are two main connection options for the RPi display, HDMI (High Definition) and

Composite (Standard Definition).

• HD TVs and many LCD monitors can be connected using a full-size 'male' HDMI

cable, and with an inexpensive adaptor if DVI is used. HDMI versions 1.3 and 1.4 are

supported and a version 1.4 cable is recommended. The RPi outputs audio and video

via HMDI, but does not support HDMI input. • Older TVs can be connected using Composite video (a yellow-to-yellow RCA cable) or

via SCART (using a Composite video to SCART adaptor). Both PAL and NTSC format

TVs are supported.

When using a composite video connection, audio is available from the 3.5mm jack socket,

and can be sent to your TV, headphones or an amplifier. To send audio to your TV, you will

need a cable which adapts from 3.5mm to double (red and white) RCA connectors.

Note: There is no analogue VGA output available. This is the connection required by

many computer monitors, apart from the latest ones. If you have a monitor with only a

D-shaped plug containing 15 pins, then it is unsuitable.

1.5. Power Supply

The unit is powered via the microUSB connector (only the power pins are connected, so

it will not transfer data over this connection). A standard modern phone charger with a

microUSB connector will do, providing it can supply at least 700mA at +5Vdc. Check

your power supply's ratings carefully. Suitable mains adaptors will be available from the

RPi Shop and are recommended if you are unsure what to use.

Note: The individual USB ports on a powered hub or a PC are usually rated to provide

500mA maximum. If you wish to use either of these as a power source then you will need a

special cable which plugs into two ports providing a combined current capability of 1000mA.

1.6. Cables

You will need one or more cables to connect up your RPi system.

• Video cable alternatives: o HDMI-A cable o HDMI-A cable + DVI adapter o Composite

video cable o Composite video cable + SCART adaptor • Audio cable (not needed if you

use the HDMI video connection to a TV) • Ethernet/LAN cable (Model B only)



1.7.Preparing your SD card for the Raspberry Pi In order to use your Raspberry Pi, you will need to install an Operating System (OS)

onto an SD card. An Operating System is the set of basic programs and utilities that

allow your computer to run; examples include Windows on a PC or OSX on a Mac.

These instructions will guide you through installing a recovery program on your SD

card that will allow you to easily install different OS’s and to recover your card if you

break it. 1.Insert an SD card that is 4GB or greater in size into your computer 2. Format the SD card so that the Pi can

read it. a.Windows i.Download the SD Association's Formatting Tool1 from

https://www.sdcard.org/downloads/formatter_4/eula_windows/ ii.Install and run the Formatting Tool on your machine

iii.Set "FORMAT SIZE ADJUSTMENT" option to "ON" in the "Options" menu

iv.Check that the SD card you inserted matches the one selected by the Tool v. Click the “Format”

button b.Mac i.Download the SD Association's Formatting Tool from

https://www.sdcard.org/downloads/formatter_4/eula_mac/ ii. Install and run the Formatting Tool on your machine iii. Select “Overwrite Format”

iv. Check that the SD card you inserted matches the one selected by the Tool v. Click the “Format” button

c.Linux

i. We recommend using gparted (or the command line version parted ) ii. Format the entire disk as FAT

3. Download the New Out Of Box Software (NOOBS)

from: downloads.raspberrypi.org/noobs

4. Unzip the downloaded file

a.Windows Right click on the file and choose “Extract

all” b. Mac Double tap on the file



c. Linux Run unzip [downloaded filename]

5. Copy the extracted files onto the SD card that you just formatted 6. Insert the SD card into your Pi and connect the power supply

7.You can also alternatively download the raspbian image from https://raspberrypi.org

Your Pi will now boot into NOOBS and should display a list of operating systems that you can

choose to install. If your display remains blank, you should select the correct output mode for your display by pressing one of the following number keys on your

keyboard; 1. HDMI mode this is the default display mode. 2.HDMI safe mode select this mode if you are using the HDMI connector and

cannot see anything on screen when the Pi has booted.

3.Composite PAL mode select either this mode or composite NTSC mode if you

are using the composite RCA video connector

4. Composite NTSC mode



4.3.DOWNLOADING OTHER SOFTWARE

1.CMU SPHINX

CMU Sphinx toolkit has a number of packages for different tasks and applications. It's

sometimes confusing what to choose. To cleanup, here is the list

Pocketsphinx — recognizer library written in C.

Sphinxtrain — acoustic model training tools

Sphinxbase — support library required by Pocketsphinx and Sphinxtrain

Sphinx4 — adjustable, modifiable recognizer written in Java

We have chosen pocketsphinx.

To build pocketsphinx in a unix-like environment (such as Linux, Solaris, FreeBSD etc) you

need to make sure you have the following dependencies installed: gcc, automake, autoconf,

libtool, bison, swig at least version 2.0, python development package, pulseaudio development

package. If you want to build without dependencies you can use proper configure options like –

without-swig-python but for beginner it is recommended to install all dependencies.

You need to download both sphinxbase and pocketsphinx packages and unpack them. Please

note that you can not use sphinxbase and pocketsphinx of different version, please make sure

that versions are in sync. After unpack you should see the following two main folders:

sphinxbase-X.X

pocketsphinx-X.x On step one, build and install SphinxBase. Change current directory to sphinxbase

folder. If you downloaded directly from the repository, you need to do this at least once

to generate the configure file:



% ./autogen.sh

if you downloaded the release version, or ran autogen.sh at least once, then compile and install:

% ./configure

% make

% make install

The last step might require root permissions so it might be sudo make install. If you

want to use fixed-point arithmetic, you must configure SphinxBase with the –enable-

fixed option. You can also set installation prefix with –prefix. You can also configure with

or without SWIG python support.

The sphinxbase will be installed in /usr/local/ folder by default. Not every system loads libraries

from this folder automatically. To load them you need to configure the path to look for shared

libaries. It can be done either in the file /etc/ld.so.conf or with exporting environment variables:

export LD_LIBRARY_PATH=/usr/local/lib

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

BUILDING LANGUAGE MODEL There are several types of models that describe language to recognize - keyword lists, grammars

and statistical language models, phonetic statistical language models. You can chose any decoding

mode according to your needs and you can even switch between modes in runtime.

Keyword lists

Pocketsphinx supports keyword spotting mode where you can specify the keyword list to look

for. The advantage of this mode is that you can specify a threshold for each keyword so that

keyword can be detected in continuous speech. All other modes will try to detect the words from

grammar even if you used words which are not in grammar. The keyword list looks like this:

oh mighty computer /1e-

40/ hello world /1e-30/ DEPARTMENT OF ECE, NMIT


other phrase /1e-20/

Threshold must be specified for every keyphrase. For shorter keyphrase you can use

smaller thresholds like 1e-1, for longer threshold must be bigger. Threshold must be

tuned to balance between false alarms and missed detections, the best way to tune

threshold is to use a prerecorded audio file. Tuning process is the following:

Take a long recording with few occurrences of your keywords and some other sounds. You can

take a movie sound or something else. The length of the audio should be approximately 1 hour Run keyword spotting on that file with different thresholds for every keyword, use the

following command:

pocketsphinx_continuous -infile <your_file.wav> -keyphrase <"your

keyphrase"> -kws_threshold \

<your_threshold> -time yes

From keyword spotting results count how many false alarms and missed detections

you've encountered

Select the threshold with smallest amount of false alarms and missed detections

For the best accuracy it is better to have keyphrase with 3-4 syllables. Too short

phrases are easily confused.

Keyword lists are supported by pocketsphinx only, not by sphinx4.

Grammars

Grammars describe very simple type of the language for command and control, and they are

usually written by hand or generated automatically within the code. Grammars usually do not

have probabilities for word sequences, but some elements might be weighed. Grammars could

be created with JSGF format and usually have extension like .gram or .jsgf.

Grammars allow to specify possible inputs very precisely, for example, that certain word might be repeated only two or three times. However, this strictness might be harmful if your user accidentally skips the words which grammar requires. In that case whole recognition will fail. For that reason it is better to make grammars more relaxed, instead of phrases list just the bag of words allowing arbitrary order. Avoid very complex grammars with many rules and cases, it just slows the recognizer, you can use simple rules instead. In the past grammars required a lot of effort to tune them, to assign variants properly and so on. The big VXML consulting industry was about that.



Language models

Statistical language models describe more complex language. They contain probabilities of

the words and word combinations. Those probabilities are estimated from a sample data

and automatically have some flexibility. For example, every combination from the

vocabulary is possible, though probability of such combination might vary. For example if

you create statistical language model from a list of words it will still allow to decode word

combinations though it might not be your intent. Overall, statistical language models are

recommended for free-form input where user could say anything in a natural language and

they require way less engineering effort than grammars, you just list the possible sentences.

For example, you might list numbers like “twenty one” and “thirty three” and statistical

language model will allow “thirty one” with certain probability as well.

Overall, modern speech recognition interfaces tend to be more natural and avoid command-

and-control style of previous generation. For that reason most interface designers prefer natural

language recognition with statistical language model than old-fashioned VXML grammars.

On the topic of design of the VUI interfaces you might be interested in the following

book: It's Better to Be a Good Machine Than a Bad Person: Speech Recognition and

Other Exotic User Interfaces at the Twilight of the Jetsonian Age by Bruce Balentine

There are many ways to build the statistical language models. When your data set is

large, there is sense to use CMU language modeling toolkit. When a model is small,

you can use an online quick web service. When you need specific options or you just

want to use your favorite toolkit which builds ARPA models, you can use it.

Language model can be stored and loaded in three different format - text ARPA format,

binary format BIN and binary DMP format. ARPA format takes more space but it is

possible to edit it. ARPA files have .lm extension. Binary format takes significantly less

space and faster to load. Binary files have .lm.bin extension. It is also possible to convert between formats. DMP format is obsolete and not recommended.

Building a grammar

Grammars are usually written manually in JSGF format:



#JSGF

V1.0; /** * JSGF Grammar for Hello World

example */ grammar hello;

public <greet> = (good morning | hello) ( bhiksha | evandro | paul | philip | rita | will );

Building a Statistical Language Model

Text preparation

First of all you need to prepare a large collection of clean texts. Expand abbreviations,

convert numbers to words, clean non-word items. For example to clean Wikipedia XML

dump you can use special python scripts like https://github.com/attardi/wikiextractor. To

clean HTML pages you can try http://code.google.com/p/boilerpipe a nice package

specifically created to extract text from HTML

For example on how to create language model from Wikipedia texts please see

http://trulymadlywordly.blogspot.ru/2011/03/creating-text-corpus-from-wikipedia.html

Once you went through the language model process, please submit your langauge

model to CMUSphinx project, we'd be glad to share it!

Movie subtitles are good source for spoken language.

Language modeling for many languages like Mandarin is largely the same as in English,

with one addditional consideration, which is that the input text must be word segmented.

A segmentation tool and associated word list is provided to accomplish this.

Using other Language Model Toolkits

There are many toolkits that create ARPA n-gram language model from text files.



Some toolkits you can try:

IRSLM

MITLM

SRILM

If you are training large vocabulary speech recognition system, the language model

training is outlined in a separate page Building a large scale language model for

domain-specific transcription.

Once you created ARPA file you can convert the model to binary format if needed.

ARPA model training with SRILM

Training with SRILM is easy, that's why we recommend it. Morever, SRILM is the most

advanced toolkit up to date. To train the model you can use the following command:

ngram-count -kndiscount -interpolate -text train-text.txt -lm your.lm

You can prune the model afterwards to reduce the size of the model

ngram -lm your.lm -prune 1e-8 -write-lm your-pruned.lm

After training it is worth to test the perplexity of the model on the test data

ngram -lm your.lm -ppl test-text.txt

ARPA model training with CMUCLMTK

You need to download and install cmuclmtk. See CMU Sphinx Downloads for details.

The process for creating a language model is as follows:

1) Prepare a reference text that will be used to generate the language model. The language model

toolkit expects its input to be in the form of normalized text files, with utterances delimited by



<s> and </s> tags. A number of input filters are available for specific corpora such as

Switchboard, ISL and NIST meetings, and HUB5 transcripts. The result should be the

set of sentences that are bounded by the start and end sentence markers: <s> and

</s>. Here's an example:

<s> generally cloudy today with scattered outbreaks of rain and drizzle persistent and

heavy at times </s> <s> some dry intervals also with hazy sunshine especially in eastern parts in the morning </s>

<s> highest temperatures nine to thirteen Celsius in a light or moderate mainly east

south east breeze </s>

<s> cloudy damp and misty today with spells of rain and drizzle in most places much of

this rain will be

light and patchy but heavier rain may develop in the west later </s>

More data will generate better language models. The weather.txt file from sphinx4 (used

to generate the weather language model) contains nearly 100,000 sentences.

2) Generate the vocabulary file. This is a list of all the words in the

file: text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab

3) You may want to edit the vocabulary file to remove words (numbers, misspellings,

names). If you find misspellings, it is a good idea to fix them in the input transcript. 4) If you want a closed vocabulary language model (a language model that has no

provisions for unknown words), then you should remove sentences from your input

transcript that contain words that are not in your vocabulary file. 5) Generate the arpa format language model with the commands:

% text2idngram -vocab weather.vocab -idngram weather.idngram < weather.closed.txt % idngram2lm -vocab_type 0 -idngram weather.idngram -vocab \

weather.vocab -arpa weather.lm

6) Generate the CMU binary form (BIN)

sphinx_lm_convert -i weather.lm -o weather.lm.bin



The CMUCLTK tools and commands are documented at The CMU-Cambridge

Language Modeling Toolkit page.

Building a simple language model using web service

If your language is English and text is small it's sometimes more convenient to use web

service to build it. Language models built in this way are quite functional for simple

command and control tasks. First of all you need to create a corpus.

The “corpus” is just a list of sentences that you will use to train the language model. As an

example, we will use a hypothetical voice control task for a mobile Internet device. We'd like to

tell it things like “open browser”, “new e-mail”, “forward”, “backward”, “next window”, “last

window”, “open music player”, and so forth. So, we'll start by creating a file called corpus.txt:

open browser

new e-mail

forward

backward

next window

last window open music player

Then go to the page http://www.speech.cs.cmu.edu/tools/lmtool-new.html. Simply click

on the “Browse…” button, select the corpus.txt file you created, then click “COMPILE

KNOWLEDGE BASE”.

The legacy version is still available online also here:

http://www.speech.cs.cmu.edu/tools/lmtool.html

You should see a page with some status messages, followed by a page entitled “Sphinx

knowledge base”. This page will contain links entitled “Dictionary” and “Language

Model”. Download these files and make a note of their names (they should consist of a 4-digit number followed by the extensions .dic and .lm). You can now test your newly

created language model with PocketSphinx.

Converting model into binary format



To quickly load large models you probably would like to convert them to binary format

that will save your decoder initialization time. That's not necessary with small models.

Pocketsphinx and sphinx3 can handle both of them with -lm option. Sphinx4

automatically detects format by extension of the lm file.

ARPA format and BINARY format are mutually convertable. You can produce other file

with sphinx_lm_convert command from sphinxbase:

sphinx_lm_convert -i model.lm -o model.lm.bin sphinx_lm_convert -

i model.lm.bin -ifmt bin -o model.lm -ofmt arpa You can also convert

old DMP models to bin format this way.

Using your language model

This section will show you how to use, test, and improve the language model you created.

Using your language model with PocketSphinx

If you have installed PocketSphinx, you will have a program called

pocketsphinx_continuous which can be run from the command-line to recognize speech.

Assuming it is installed under /usr/local, and your language model and dictionary are called

8521.dic and 8521.lm and placed in the current folder, try running the following command:

pocketsphinx_continuous -inmic yes -lm 8521.lm -dict 8521.dic

This will use your new language model and the dictionary and default acoustic model.

On Windows you also have to specify the acoustic model folder with -hmm option

bin/Release/pocketsphinx_continuous.exe -inmic yes -lm 8521.lm -dict 8521.dic -hmm

model/en-us/en-us

You will see a lot of diagnostic messages, followed by a pause, then “READY…”. Now

you can try speaking some of the commands. It should be able to recognize them with

complete accuracy. If not, you may have problems with your microphone or sound card.



Using your language model with Sphinx4

In Sphinx4 high-level API you need to specify the location of the language model in

Configuration:

configuration.setLanguageModelPath("file:8754.lm");

If the model is in resources you can reference it with resource: URL

configuration.setLanguageModelPath("resource:/com/example/8754.lm");

GENERATING THE DICTIONARY

There are various tools to help you to extend an existing dictionary for new words or to build

a new dictionary from scratch. If your language already has a dictionary it's recommended

to use since it's carefully tuned for best performance. If you starting a new language you

need to account for various reductions and coarticulations effects. They make it very hard to

create accurate rules to convert text to sounds. However, the practice shows that even

naive conversion could produce a good results for speech recognition. For example, many

developers were successful to create ASR with simple grapheme-based synthesis where

each letter is just mapped to itself not to the corresponding phone.

For most of the languages you need to use specialized grapheme to phoneme (g2p)

code to do the conversion using machine learning methods and existing small

database. Nowdays most accurate g2p tools are Phonetisaurus:

http://code.google.com/p/phonetisaurus

And sequitur-g2p:

http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html



Also note that almost each TTS package has G2P code included. For example you can

use g2p code from FreeTTS written in Java:

http://cmusphinx.sourceforge.net/projects/freetts

See FreeTTS example in Sphinx4 here

OpenMary Java TTS:

http://mary.dfki.de/

or espeak for C:

http://espeak.sourceforge.net

Please note that if you use TTS you often need to do phoneset conversion. TTS phonesets are

usually more extensive than required for ASR. However, there is a great adavantage in TTS

tools because they usually contain more required functionality than simple G2P. For example,

they are doing tokenization by converting numbers and abbreviations to spoken format.

For English you can use simplier capabilities by using on-line webservice:

http://www.speech.cs.cmu.edu/tools/lmtool.html

Online LM Tool, produces a dictionary which matches its language model. It uses the latest

CMU dictionary as a base, and is programmed to guess at pronunciations of words not in the

existing dictionary. You can look at the log file to find which words were guesses, and make

your own corrections, if necessary. With the advanced option, LM Tool can use a hand-made

dictionary that you specify for your specialized vocabulary, or for your own pronunciations as

corrections. The hand dictionary must be in the same format as the main dictionary

If you want to run lmtool offline you can checkout it from subversion:



http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/logios

2.TEXT TO SPEECH

eSpeak is a compact open-source speech synthesizer for many platforms. Speech

synthesis is done offline, but most voices can sound very “robotic”.

Festival uses the Festival Speech Synthesis System, an open source speech

synthesizer developed by the Centre for Speech Technology Research at the University

of Edinburgh. Like eSpeak, also synthesizes speech offline.

Initial voice was espeak later changed to

festival. sudo apt-get update sudo apt-get install festival festvox-don



4.4.SETTING UP LIRC

First, we’ll need to install and configure LIRC to run on the

RaspberryPi: sudo apt-get install lirc Second,You have to modify two files before you can start testing the receiver and

IR LED. Add this to your /etc/modules file:

lirc_dev

lirc_rpi gpio_in_pin=23 gpio_out_pin=22

Change your /etc/lirc/hardware.conf file to:

########################################################

# /etc/lirc/hardware.conf # # Arguments which will be used when launching

lircd LIRCD_ARGS="--uinput"

# Don't start lircmd even if there seems to be a good config file # START_LIRCMD=false

# Don't start irexec, even if a good config file seems to exist. # START_IREXEC=false

# Try to load appropriate kernel modules

LOAD_MODULES=true



# Run "lircd --driver=help" for a list of supported

drivers. DRIVER="default"

# usually /dev/lirc0 is the correct setting for systems using

udev DEVICE="/dev/lirc0"

MODULES="lirc_rpi"

# Default configuration files for your hardware if any

LIRCD_CONF=""

LIRCMD_CONF=""

########################################################

Now restart lircd so it picks up these changes:

sudo /etc/init.d/lirc stop

sudo /etc/init.d/lirc start

Edit your /boot/config.txt file and add:

dtoverlay=lirc-rpi,gpio_in_pin=23,gpio_out_pin=22



Now ,connect the circuit .

Fig 11.Schematic

Testing the ir receiver

Testing the IR receiver is relatively straightforward.

Run these two commands to stop lircd and start outputting raw data from the IR receiver:


mode2 -d /dev/lirc0

Point a remote control at your IR receiver and press some buttons. You should see

something like this:

space 16300

pulse 95

space 28794


PROJECT REPORT 2015-2016 pulse 80

space 19395

When using irrecord it will ask you to name the buttons you’re programming as you program

them. Be sure to run irrecord --list-namespace to see the valid names before you begin.

Here were the commands that we ran to generate a remote configuration file:

# Stop lirc to free up /dev/lirc0


# Create a new remote control configuration file (using /dev/lirc0) and save the output to

~/lircd.conf

irrecord -d /dev/lirc0 ~/lircd.conf

# Make a backup of the original lircd.conf file

sudo mv /etc/lirc/lircd.conf /etc/lirc/lircd_original.conf

# Copy over your new configuration file

sudo cp ~/lircd.conf /etc/lirc/lircd.conf

# Start up lirc again

sudo /etc/init.d/lirc start

Once you’ve completed a remote configuration file and saved/added it to

/etc/lirc/lircd.conf you can try testing the IR LED. We’ll be using the irsend application

that comes with LIRC to facilitate sending commands. You’ll definitely want to check out

the documentation to learn more about the options irsend has.

Here are the commands I ran to test my IR LED (using the “tatasky” remote

configuration file I created):

# List all of the commands that LIRC knows for

'tatasky' irsend LIST tatasky ""


PROJECT REPORT 2015-2016 # Send the KEY_POWER command once

irsend SEND_ONCE tatasky KEY_POWER

# Send the KEY_VOLUMEDOWN command once

irsend SEND_ONCE tatasky KEY_VOLUMEDOWN

Last step, is to connect the module to python program.

4.5 WORKING OF IR RECEIVER AND TRANSMITTER

An IR LED, also known as IR transmitter, is a special purpose LED that transmits infrared rays

in the range of 760 nm wavelength. Such LEDs are usually made of gallium arsenide or

aluminium gallium arsenide. They, along with IR receivers, are commonly used as sensors.

The emitter is simply an IR LED (Light Emitting Diode) and the detector is simply an IR photodiode which is sensitive to IR light of the same wavelength as that emitted by the

IR LED. When IR light falls on the photodiode, its resistance and correspondingly, its

output voltage, change in proportion to the magnitude of the IR light received. This is

the underlying principle of working of the IR sensor.



4.6 FLOW CHART OF PROGRAM

Fig 12 Flowchart

The flowchart of the python script is shown below. Where the voice input is first

verified if it is the keyword. Then the system sends a high beep through the audio out, to

indicate microphone is actively listening. The voice input now given is compared with

the configured commands and the corresponding function is called.



4.7 BLOCK DIAGRAM

Here we are using CMU Sphinx with jasper-client brain which implements deep learning algorithm.

Python modules are written for various functions. First the keyword which is configured is said, we will hear a high beep, which means jasper is listening.

Now the command is given ,which is decoded and searched by the pocketshinx dictionary by HMM

computation. Match is found to mentioned words in modules and the appropriate function is executed. Which can be

playing a song or video or reading a book or changing TV channel or playing a quiz game.

The song and video database can have any regional language songs as well. The

output of the system is then heard through the speakers or earphones.

Fig 13.Block Diagram of System



CHAPTER 5

FURTHER ENHANCEMENTS

1.RECOGNITION WITHOUT INTERNET ACCESS

We are well aware that there is no availability of internet access throughout our

country. Currently, India is nowhere near meeting the target for a service which is

considered almost a basic necessity in many developed countries.

In such cases this project may not function, therefore we have enhancing this

project to work even without internet using recognition toolkits such as CMU Sphinx.

2. GSM Module for voice activated calling

Raspberry PI SIM800 GSM/GPRS Add-on V2.0 is customized for Raspberry Pi

interface based on SIM800 quad-band GSM/GPRS/BT module. AT commands can be

sent via the serial port on Raspberry Pi, thus functions such as dialing and answering

calls, sending and receiving messages and surfing on line can be realized. Moreover,

the module supports powering-on and resetting via software.

Fig.14 GSM Quadband 800A



3. HOME AUTOMATION

With the right level of ingenuity, the sky's the limit on things you can automate in your

home, but here are a few basic categories of tasks that you can pursue: Automate your lights to turn on and of on a schedule, remotely, or when certain conditions are

triggered. Set your air conditioner to keep the house temperate when you're home and save energy while

you're away.

Fig.15 Home automation possibilities



CHAPTER 6

APPLICATIONS

Usage in education and daily life

For language learning, speech recognition can be useful for learning a second

language. It can teach proper pronunciation, in addition to helping a person develop

fluency with their speaking skills.[6]

Students who are blind (see Blindness and education) or have very low vision can

benefit from using the technology to convey words and then hear the computer recite

them, as well as use a computer by commanding with their voice, instead of having to

look at the screen and keyboard. [6]

Aerospace (e.g. space exploration, spacecraft, etc.) NASA’s Mars Polar Lander used

speech recognition from technology Sensory, Inc. in the Mars Microphone on the Lander[7]

Automatic subtitling with speech recognition[7]

Automatic translation

Court reporting (Realtime Speech Writing)

Telephony and other domains

ASR in the field of telephony is now commonplace and in the field of computer gaming

and simulation is becoming more widespread. Despite the high level of integration with

word processing in general personal computing. However, ASR in the field of document

production has not seen the expected[by whom?] increases in use.

The improvement of mobile processor speeds made feasible the speech-enabled Symbian and

Windows Mobile smartphones. Speech is used mostly as a part of a user interface, for creating



predefined or custom speech commands. Leading software vendors in this field are:

Google, Microsoft Corporation (Microsoft Voice Command), Digital Syphon (Sonic

Extractor), LumenVox, Nuance Communications (Nuance Voice Control), VoiceBox

Technology, Speech Technology Center, Vito Technologies (VITO Voice2Go), Speereo

Software (Speereo Voice Translator), Verbyx VRX and SVOX.

In Car systems

Typically a manual control input, for example by means of a finger control on the

steering-wheel, enables the speech recognition system and this is signalled to the driver

by an audio prompt. Following the audio prompt, the system has a "listening window"

during which it may accept a speech input for recognition.

Simple voice commands may be used to initiate phone calls, select radio stations or play

music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice

recognition capabilities vary between car make and model. Some of the most recent car

models offer natural-language speech recognition in place of a fixed set of commands.

allowing the driver to use full sentences and common phrases. With such systems there is,

therefore, no need for the user to memorize a set of fixed command words.

Fig 16.Car automation



Helicopters

The problems of achieving high recognition accuracy under stress and noise pertain strongly to

the helicopter environment as well as to the jet fighter environment. The acoustic noise problem

is actually more severe in the helicopter environment, not only because of the high noise levels

but also because the helicopter pilot, in general, does not wear a facemask, which would reduce

acoustic noise in the microphone. Substantial test and evaluation programs have been carried

out in the past decade in speech recognition systems applications in helicopters, notably by the

U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal

Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in

the Puma helicopter. There has also been much useful work in Canada. Results have been

encouraging, and voice applications have included: control of communication radios, setting of

navigation systems, and control of an automated target handover system.

As in fighter applications, the overriding issue for voice in helicopters is the impact on

pilot effectiveness. Encouraging results are reported for the AVRADA tests, although

these represent only a feasibility demonstration in a test environment. Much remains to

be done both in speech recognition and in overall speech technology in order to

consistently achieve performance improvements in operational settings.

High-performance fighter aircraft

Substantial efforts have been devoted in the last decade to the test and evaluation of

speech recognition in fighter aircraft. Of particular note is the U.S. program in speech

recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16

VISTA), and a program in France installing speech recognition systems on Mirage aircraft,

and also programs in the UK dealing with a variety of aircraft platforms. In these programs,

speech recognizers have been operated successfully in fighter aircraft, with applications

including: setting radio frequencies, commanding an autopilot system, setting steer-point

coordinates and weapons release parameters, and controlling flight display.



REFERENCES

[1] D. Yu and L. Deng"Automatic Speech Recognition: A Deep Learning Approach"

(Publisher: Springer) published near the end of 2014,

[2]Claudio Becchetti, Klucio Prina Ricotti.“Speech Recognition: Theory and C++

Implementation “ 2008 edition

[3]Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker

identification using Gaussian mixture speaker models" (PDF). IEEE Transactions on

Speech and Audio Processing (IEEE) 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063-

6676. OCLC 26108901. Retrieved 21 February 2014.

[4]Waibel, Hanazawa, Hinton, Shikano, Lang. (1989) "Phoneme recognition using time-

delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing."

[5] Microsoft Research. "Speaker Identification (WhisperID)". Microsoft. Retrieved 21

February 2014.

[6]]'Low Cost Home Automation Using Offline Speech Recognition', International Journal of

Signal Processing Systems, vol. 2, no. 2, pp. 96-101, 2014.

[7]Juang, B. H.; Rabiner, Lawrence R. "Automatic speech recognition–a brief history of

the technology development" (PDF). p. 6. Retrieved 17 January 2015.

[8] Deng, L.; Li, Xiao (2013). "Machine Learning Paradigms for Speech Recognition: An

Overview". IEEE Transactions on Audio, Speech, and Language Processing.

[9]P. V. Hajar and A. Andurkar, “Facial Recognition and Speech Recognition using Raspberry Pi', International Journal of Advanced Research in Computer and CommunicationReview Paper

on System for Voice and F Engineering, vol. 4, no. 4, pp. 232-234, 2015.

[10]Common Health Risks of the Bedridden Patient Posted on October 24, 2013 by

Carefect Blog Team




Date post:	15-Apr-2018
Category:	Documents
Upload:	dinhbao
View:	216 times
Download:	3 times

PROJECT REPORT 2015-2016 - KSCST · PROJECT REPORT 2015-2016 ... 3.Sound Card (Qauntum) ... Speech...

Documents