Design of a Limited Speech Recognition System for Use in a Braill

8/6/2019 Design of a Limited Speech Recognition System for Use in a Braill

http://slidepdf.com/reader/full/design-of-a-limited-speech-recognition-system-for-use-in-a-braill 1/117

McMaster University

DigitalCommons@McMaster

EE 4BI6 Electrical Engineering BiomedicalCapstones

Department of Electrical and ComputerEngineering

4-23-2010

Design of a Limited Speech Recognition Systemfor use in a Braille Teaching Device

Brett Lindsay McMaster University

This Capstone is brought to you for free and open access by the Department of Electrical and Computer Engineering at DigitalCommons@McMaster.

It has been accepted for inclusion in EE 4BI6 Electrical Engineering Biomedical Capstones by an authorized administrator of

DigitalCommons@McMaster. For more information, please contact [email protected].

Recommended CitationLindsay, Brett, "Design of a Limited Speech Recognition System for use in a Braille Teaching Device" (2010). EE 4BI6 Electrical Engineering Biomedical Capstones. Paper 34.http://digitalcommons.mcmaster.ca/ee4bi6/34

http://digitalcommons.mcmaster.ca/

http://digitalcommons.mcmaster.ca/ee4bi6


http://digitalcommons.mcmaster.ca/ece


mailto:[email protected]

mailto:[email protected]





http://digitalcommons.mcmaster.ca/



Design of a LimitedSpeech Recognition System

for use in a

Braille Teaching Deviceby

Brett Lindsay

Electrical and Biomedical EngineeringFaculty Advisor: Dr. Thomas E. Doyle

Electrical and Biomedical Engineering Project Reportsubmitted in partial fulfillment of the degree of

Bachelor of Engineering

McMaster UniversityHamilton, Ontario, Canada

April 23, 2010

Copyright c April 2010 by Brett Lindsay

1



Abstract

The report here submitted defines the scope and content of the Electrical and Biomedical Engineering

Capstone Project as submitted by Brett Lindsay. This project involved the creation of a limited Speech

Recognition system for use in a Braille Teaching Device. The greater project (that of the Braille

Teaching Device) was completed in tandem with Messrs. Chris Agam and Jonathon Hernandez. It was

felt that the Speech Recognition component would be a valuable addition to the project due to the

nature of a teaching device for use by the visually impaired (who would need an assistant to use said

device). The Speech Recognition system was creating by breaking the problem into four subsections:

the collection of data upon call by the teaching program, the manipulation of data, the recognition

algorithms to categorize said data, and the passing of results back to the teaching program. For the

recognition block, the relatively simple method of Dynamic Time Warping was chosen over more

complex options such as Hidden Markov Models or Neural Networks. This method presented some

problems as documented, specifically a tendency to favour letters with larger file sizes (such as 'w').

The Speech Recognition system created during the course of this project failed to deliver on the wanted

efficiency of 60 % and low as possible false positives. While the Speech Recognition presented is

viable, the effectiveness is below that which can be found in market for comparable price.

2



Acknowledgements

Chris Agam was a student at McMaster University in the Electrical and Biomedical Engineering

program and was a member of the group creating a Braille Teaching Device. His project was the

physical device itself. He provided the idea for the project.

Jon Hernandez was a student at McMaster University in the Electrical and Biomedical Engineering

program and was a member of the group creating a Braille Teaching Device. His project was the

programming of the micro controller as well as software for use with the device. He took part in the

creation of the communication between his software, Mr. Agam's device, and Mr. Lindsay's speech

recognition system.

Billy Taj was a student at McMaster University in the Mechatronics Engineering program and provided

additional (basic) feedback in the testing of the device.

Dr. Thomas Doyle was a professor at McMaster University and functioned as the faculty adviser for the

duration of the project.

3



Contents

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . ii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …vii

NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.3 General Approach to the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

2 Literature Review 32.1 Speech Recognition Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Common Methods of Implementing Speech Recognition . .. . . . . . . . . . . . . . . . . . . . . . . . 32.3 Spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Comparable Project Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Problem and Methodology of Solutions 93.1 Statement of Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Methodology of Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

3.4 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113.4.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4.2Windowing. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4.3 Cepstral Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Recognition Algorithm: Dynamic Time Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143.5.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143.5.2 Match Matrix. .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5.3 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5.4 Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.6 Returning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Design Procedures 17

4.1 Speech Recognition Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194.3.2Windowing. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.3 Cepstral Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

4.4 Recognition Algorithm: Dynamic Time Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4



4.4.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214.4.2 Match Matrix. .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .224.4.3 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .224.4.4 Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

4.5 Returning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .245 Testing Results and Discussion 25

5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .255.2 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .255.2.1 Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2.2Windowing. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .265.2.3 Cepstral Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Recognition Algorithm: Dynamic Time Warping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275.3.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .275.3.2 Match Matrix. .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3.3 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3.4 Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Returning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Conclusions and Recommendations 38

6.1 Conclusions on Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Appendix A: Computer Software Design Tools 39Appendix B: Additional Testing Notes 40Appendix C: Code of Software Elements 71References 107Vitae 108

5



List of Tables2.1 Results from [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85.1 Time Difference DTW/DTWTHREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .355.2 Average Time and Size of wav/txt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .365.3 Results of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

6



List of Figures1.1 Braille Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Neural Network and HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 DTW Simplified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 DTW Simplified Part Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Mathematical Equations of Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Spectrogram of an Audio Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1 Speech Recognition Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Data Aquisition Tutorial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Hamming Window in MatLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Visualization of Cepstral Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Visualization of Match Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143.6 Visualization of Distortion Matrix Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Speech Recognition Program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .174.2 Select Code From Recorder.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Select Code From normalizer.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Select Code From hamWindow.m & usefullSig.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Select Code From cepAnal.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.6 Select Code From Comparison Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.7 Select Code From specCreate.m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.8 Select Code From matchMat.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.9 Select Code From DTW.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.10 Visualization of Faster Trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.11 Select Code From DTWTHREEm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.12 Select Code From .libCreat.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .244.13 Select Code From speechRec.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1 Phases of Data Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Timing Measurements of Data Manipultations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Spectrograms. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Timing Measurements of Pattern Recognition .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5 Results and c Values DTW Original . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.6 C Values for DTW Original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.7 C Values Plotted Against File Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.8 Workings of matchMat and DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .325.9 C Values for DTW w Trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.10 Results and c Values DTW w Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.11 Visualization of DTW vs DTW w Breaking Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.12 Speed of Pattern Recognition for Library Sample Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .365.13 Time and Size of wav/txt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

7



NomenclatureDelimiter: a character used to separate independent pieces of data in a text files or data streams.

DTW: Dynamic Time Warping

HMM: Hidden Markov Models

m-file: the format of MatLab files.

NN: Neural Networks

Phoneme: The smallest unit of sound. ie the sound 'ahh' or 'eee'.

Quefrency: a pseudo time domain resulting from Cepstral analysis.

Spectrogram: A representation of the frequencies native to a small portion of time in a signal.

SR: Speech Recognition

8



1 Introduction

1.1 Background

The greater group project, a collaboration between Chris Agam, Jon Hernandez, and myself, is a Braille

teaching device to be used by the visually impaired. In the USA and across the world Braille literacy

numbers are staggeringly low; as of 2009 only ten percent of blind children in America are Braille

literate [12].

Braille itself is a form of writing for the blind, consisting of six "cells" arranged two by three. By

raising dots in various cells via various combinations, one creates letters. For example, in Figure 1.1

below can be seen the letters 'a' and 'p', where the black dots represent the raised bumps.

The scarcity of those fluent in the language makes it a prime candidate for an electronic teaching

device, allowing people to simply plug in and learn. This interactive nature vastly improves upon the

teaching capabilities of a book and assistant, as the assistant will most likely not be fluent in Braille

themselves. A teaching program will therefore allow the assistant a great deal more ability in assisting

the learner.

Further, an important aspect of teaching methods and/or devices is the testing of the pupil on the

subjects being learned. The fact that the pupil will be blind poses a challenge. While possible for the

1

Figure 1.1: The Braille representations of

the letters 'a' and 'p'.



teaching program to allow an assistant to act as a supervisor of testing, a better solution would be direct

interaction between user and program.

To this end, this project is focusing on the development of efficient and lean speech recognition

software that will allow the user to test themselves as they learn. By independently creating our ownsoftware we cut down on cost as well as on the superfluous abilities of commercially available software

(which focus on continuously reconstructing large, complex sentences independent of the speaker).

1.2 Objectives

The objective of the project is the creation of speech recognition software for the Braille Teaching

Device. The method will roughly follow the steps outlined by Jawed, et al [8] in their creation of a

similar system. Their results of a 68% efficiency gave me confidence to say that a minimum 60%

efficiency is deliverable.

It is also necessary for the program to run as fast as possible in order to be useful. It should take, on

average, no more than two seconds (plus recording time) to run.

1.3 General Approach to the Problem

The problem was to create a limited speech recognition program. This was to be achieved using

Mathwork's MATLAB programming environment. After researching the various methods of speech

recognition in use today, it was decided upon to use the simpler Dynamic Time Warping method. This

would allow for confidence in the ability of myself to complete the project, as opposed to other

methods which could have proven too difficult to implement.

1.4 Scope of the Project

The scope of the project was necessarily limited. As this project was being undertaken individually

there was some worry as to the complexity due to the fact that speech recognition projects are rather

difficult for even professional entities to create. Therefore, it was deemed that the software would only

recognize thirty entries: a-z of the alphabet as well as commands 'enter', 'yes', 'no', and 'back'.

2



2 Literature Review

2.1 Speech Recognition Basics

Six diverse articles have been noted that cover the breadth of speech recognition theory and

implementation. Articles like [7] provided general information about the basics, while those like [2]

and [10] provided background on areas of speech recognition that will not necessarily be used in the

project but help build a full understanding of the options open. References [1] and [11] have provided a

background in the implementation of DTW in respects to speech recognition. [8] is the best piece, as it

outlines the general steps used to create a speech recognition software package similar to this project.

Three books were also looked at. [5] focused on the human creation and recognition of speech, and

while providing background was less useful from a practical standpoint. [4] was helpful in

understanding the concept of cepstral analysis as well as the need for a non-rectangular window in thedata-manipulation phase. [6] provided information on a broad range of topics in a more practical sense

then the other books (though still majorly theoretical).

The field of speech recognition can be broken down into discrete or continuous recognition, as well as

speaker independent or dependent. Discrete systems require the user to pause between sounds, while

continuous systems operate without breaks [7]. Speaker dependence requires the user to have done

some training with the system to allow it to recognize the user, where as independent systems will work

regardless of user speech patterns, tones, et cetera (eg. automated phone services) [7]. Discrete,

dependent systems are the easiest to create.

2.2 Common Methods of Implementing Speech Recognition

For the actual speech recognition component of the system, there are three main methods found in

literature. The first is the Hidden Markov Model (HMM). It is a mathematical model where the future

state’s likelihood is dependent on the current state where the states are unobserved [2]. It is complex

and very good at identifying speech that is slurred and accented (as in reality, where the computer will

be unable to identify most information passed in and has to construct sentences out of what little it did

understand). It also far exceeds the complexity of the project’s goals, and so will not be used.

Also common to speech recognition system are Neural Networks (NN). [2] Features in a context

3



window are run through a system of weighted nodes, the output of which is a classification of each

input frame, measured in terms of the probabilities of phoneme-based categories

Dynamic Time Warping (DTW), the final method found, will be used in stead of the previously

mentioned. This involves the modification of the input data’s temporal characteristics to fit within the

realms of a standard template, followed by (relatively) simple matching techniques [11]. In reality, this

is achieved by taking the entered and manipulated data and creating a spectrogram. One then takes the

data one wishes to compare the input with, and creates a spectrogram of it as well. Next, a local match

matrix is created, defined as the cosine difference between the points in the two spectrogram matrices.

From this local match matrix, one can trace through the "path of least resistance" to get the quickest

path through. It is then a simple matter to use this value in a comparison structure to match an input

signal against a variety of template signals and achieve a "best fit".

The concept of all three methods can be a little difficult to wrap one's head around. HMMs and NNs

were not used, so the project no longer concerns itself with their in-depth workings. DTW is a simple

enough concept once one takes the time to simplify the example. For example, look at Figure 2.2. Here,for the purpose of demonstration, rather than matching audio signals one will use letters. On the left is

an example of "CHRIS" matched against "CHRIS", while on the right we find "CHRIS" matched

against "JON".

4

Figure 2.1: Diagrams of Neural Networks (left) and Hidden Markov Models (right).



The local match here is also simplified, and created with the "distance" away from a letter being equal

to 1. So in the left matrix, one can see in the bottom left that it starts at 0 as both axis have the same

value (blank). As one goes up the column, the value of the match gets further away from the wanted

value (blank) and so continuously increases. At the point where both axis are the same value (along the

centre diagonal) the match matrix continues to have values of zero due to these being matches, while

pushing away from the diagonal continually increases due to increased mis-match..

On the right portion of Figure 2.2, where "CHRIS" is matched against "JON", one can comparatively

see what it is like when there are no matches between the letters in the words being tested. The further

into the comparison, the more mis-match there is.

In Figure 2.3 below, there is a comparison of two signals which are closer to being similar then

"CHRIS" and "JON". Both "CHRIS" and "KRIIS" share the feature of ending in "IS". Note how, while

the values of the match matrix increase along the first three mis-matched letters, the final two are

matched and so carry the current value along.

5

Figure 2.2: Example of function of DTW using names Chris and Jon.

Figure 2.3: Example of function of DTW using names Chris and a misspelling of Chris, "Kriis".



A more in-depth discussion of the mathematics of DTW are gone into in section 3.5.

After the local match matrix is created, there are two methods for creating a comparison between

signals. The best is to use the "final" value in the local match matrix as the definition of the best match.

(In Figure 2.3, this would be the value in the top right corner. In the actual code created, this final valuewill be represented in the bottom right corner - see sections 4, 5) . In Figure 2.3, the two comparisons

work out to have a best of '0' for "CHRIS" vs "CHRIS" and a best of '3' for "CHRIS" vs "KRIIS". One

would therefore deem that the left comparison is the best match, and predict that that was the word

said.

Another method is to trace through the the local match matrix and use the length of this trace as the

definition of the best match. This has some practical advantages over the other method, as

demonstrated in the results section of this report (section 5). However, there are some large negatives to

this method. As one can see in the example presented in Figure 2.3, the trace of least resistance for

matching "CHRIS" against both "CHRIS" and "KRIIS" are the same, despite the fact that one is a

much better match then the other.

These values will need to be normalized to a value which is equal for matrices of various sizes, as there

is the obvious problem of larger signals taking more steps and therefore producing larger final values.

2.3 Spectrograms

Understanding spectrograms is necessary to understanding the project. In the explanation of DTW the

input signals "CHRIS" etc. were somewhat glossed over. In actual practice, the match matrix will be

created by comparing the spectrograms of two audio signals. A spectrogram is a representation of the

power spectral density inherent to an audio signal over time. That is to say, it is the magnitude of the

frequencies native to a point in time of a signal. The exact mathematical formulas involved in it's

creation are seen in Figure 2.4.

6

Figure 2.4: Mathematical equations for creation of a Spectrogram



The STFT stands for Short Time Fourier Transform. This works by taking the Fourier Transform of the

signal x(t) for only one short area at a time. This area is determined by the windowing function w(t-

tau). The windowing function w slides along the signal, so as to zero everything in the signal except for

a very small part at which one wishes to find the frequency components. By sliding the window, taking

the Fourier, sliding the window, taking the Fourier, etc. one builds up a series of frequency values forthe specific small amounts of time.

This can hopefully be understood via Figure 2.5. This figure shows an audio signal (in y-magnitude vs

x-time) and it's resulting spectrogram (magnitude of y-frequency at x-time). One can see for the first

pixel in the x-time - a time with very little signal - there are only smaller y-frequencies (less than

1kHz). But if you take a pixel from the x-time closer to 0.2s, one can see that that pixel's corresponding

frequencies are much larger (up to 4kHz).

7

Figure 2.5: Spectrogram of an audio signal.



2.4 Comparable Project Results

[8] has some testing that has allowed for the gauging beforehand of the type of efficiency results that

are achievable. Reproduced in the below Table 2.1 are their results.

8

Table 2.1: Results for project [8].



3 Problem and Methodology of Solutions

3.1 Statement of Problem

The basic problem is the identification of speech. The goal of the project as stated in the Proposal was

for the software to recognize a combination of discrete speaker dependent commands (eg. 'Enter') and

discrete speaker independent characters (eg. 'a') for testing.

The speech recognition system was creating by breaking the problem into four subsections: the

collection of data when signaled by the teaching program, the manipulation of data, the recognition

algorithms to categorize said data, and the passing of results back to the teaching program.

3.2 Methodology of Solutions

There are four basic steps in the speech recognition system. The initial phase will be data acquisition;the entering of data into the computer system from the user. Following this, the data must then be

manipulated into a usable form. After this, robust recognition algorithms must be used to match the

input data with data saved in library to correctly identify the sound. Finally, the identified sound must

be passed out to the teaching program.

9

Figure 3.1: Flow diagram of the Speech Recognition blocks.



3.3 Data Collection

Data collection was achieved through the MatLab Data Acquisition Toolbox. This toolbox allows one

to interact with Microsoft windsound and take in audio signals directly from a microphone installed on

the computer in use. The toolbox will automatically bring in this data and store it as a workable matrix

in the MatLab environment.

While it is possible to do continuous recognition with the Data Acquisition Toolbox, it had already been

decided upon to build a discrete system. This would involve the use of triggers and set samples. It was

decided that a good length of time to allow the user to input would be 3 seconds. This was chosen as it

would allow the user enough time to say the letter even if they were somewhat unprepared.

The beginning and end of the data collection were deemed to necessitate a noise, to inform the user that

it had begun/stopped recording.

The Data Acquisition Toolbox came with a tutorial in it's use. The sample code provided was a good

place to start in learning how to use said device. Below is said code.

This is relatively simple and makes recording audio very simple. The first line is to set the type of

analog input being used - in this case winsound. One could modify this to be viable with a number of

comparable softwares for use on other systems (such as Mac Python).

The Sample Rate sets the sampling frequency (in Hz) and the Samples Per Trigger can be used to set

the length of the signal to be recorded (in this case, 3/8s). Once all of the wanted parameters are set,

one starts the analog input and the recording is done for you. This is then put into a matrix via the

getdata function, in a form that one can easily manipulate.

10

ai = analoginput('winsound');

addchannel(ai, [1 2]);

set(ai, 'SampleRate', 8000);

set(ai, 'SamplesPerTrigger', 3000);

set(ai, 'TriggerType', 'immediate');

start(ai);[data,time] = getdata(ai);

Figure 3.2: Data Aquisition Toolbox tutorial code.



3.4 Data Manipulation

3.4.1 Normalization

Normalization is the process by which the signal is brought into a range consistent with expected

values. Original research into the creation of a normalization algorithm lead to the thought that it would

require analysis of the signal in order to find the peak value, followed by a reduction of the signalsamplitude. As in, one would have to run through the entire signal, record the maximum value, and then

go about the entire signal once again and reduce every point based on this maximum.

This posed the problem of being quite computationally wasteful, and as such initial thoughts were put

into a means by which this could be done at the same time as the program was checking the signal

values for the necessary windowing (see 3.4.2).

An alternative was created when looking for a way to maximize the potential provided by using the

MatLab program instead of another programming environment. MatLab has the advantage of being

built around the quick manipulation of whole matrices, and as such it was realised that one would be

able to normalize a signal by merely dividing by the built in max function.

3.4.2 Windowing

Windowing is the process of dividing the signal into small sections to be looked at independently of

one another, and is simple to achieve (multiply the signal by zero except at point of interest). For this

program, it is assumed that the only region of interest is the letter being spoken. As such, it is not

necessary to window the signal multiple times - one need only determine where the useful portion of

the signal is and cut away everything else.

A function will be created to handle the extraction of the useful signal from the total three seconds of

input data. A rough form of zero crossing will be used to determine when a useful signal has begun and

ended. This involves the checking for a certain level - the "zero" - to be crossed.

Once the useful signal has been extracted, the harsh cutoff at the edges poses a problem. These will

create a frequency signal approaching infinite. As the later pattern recognition stages depend on

creating spectrograms, this could be a problem (see 2.3 for description of spectrograms) [4]. As such,

11



one is required to use a window which is capable of removing the high frequencies at the edges while

not wrecking the frequency information present in the useful signal. Techniques for this include use of

a Hamming window. A Hamming Window can be described by the equation:

w[n] = {0.54-0.46cos(2pi*n/(N-1) 0<=n<=N-1

{0 otherwise

MatLab has a Hamming Window function built in, and so this will be used for ease (rather than using

the above formula). Figure 3.3 is a visualization in MatLab of a Hamming Window both in the time

domain and the frequency domain.

3.4.3 Cepstral Filtering

Ceptral Analysis involves the use of the Inverse DFT to separate the person’s characteristic vocal tract

sounds from the actual speech. The process for Cepstral Analysis has been well detailed via

information from [4], and should not be difficult to do from basic knowledge in MatLab coding

techniques.

12

Figure 3.3: Hamming Window in time and frequency domain as in MatLab



Cepstral filtering is very useful in the creation of a speech recognition system as one is required to

match speech, as opposed to voices. As such, the removal of sound distinctive to the user's vocal tract

will improve on the abilities of the pattern recognition.

Cepstral filtering works as follows:The audio signal of one's voice has two components - the vocal excitation source (s) and the

vocal tract source (v). These two sources form the signal via a convolution such that:

f(t) = v(t)*s(t)

In order to remove the unwanted v(t), we take the Fourier Transform to get the frequency

domain representation, where a convolution in time becomes a multiplication.

|F(f)| = |V(f)|x|S(f)|

We can then make the two distinct by using the properties of the logarithm.

ln(|F(f)|) = ln(|V(f)|) + ln(|S(f)|)

If one now takes the inverse Fourier Transform, one ends up with a representation of the

original signal in what is termed the "quefrency" where the two signals have been separated (an

addition in frequency is an addition in time). The quefrency is in units of time, but it is no

longer an accurate representation of time, hence the new name. The movement into the

quefrency is visualized in Figure 3.4.

The wanted s components of human speech are known to reside in the lower quefrencies. In Figure 3.4,

the spike at a quefrency of ~8.5 ms is the v component, and can be filtered out.

Native to the MatLab coding environment are the functions cceps and icceps. These perform the

forward and inverse cepstral transformations of a signal into and out of the quefrency domain.

13

Figure 3.4: Visualization of the steps involved in Cepstral Filtering.



3.5 Recognition Algorithm: Dynamic Time Warping

The methodology of creating the recognition algorithm will be as such:

1. Audio signal has been input and manipulated into a form which is better for pattern

recognition. Now, it's spectrogram will be created.

2. The creation of a match matrix using the spectrograms of the input signal and of the variousreference signals stored in library.

3. The DTW process on the match matrix to get a value of relationship between the two signals.

3.5.1 Spectrogram

The concept of the spectrogram was detailed in section 2.3. MatLab allows one to easily create a

spectrogram of a signal with the built in specgram function.

3.5.2 Match Matrix

The match matrix is the overlay of two signals' spectrograms. This is done by finding the cosine

distance of the angle between two vectors for each point in the matrix [3], and is an example of a form

of Euclidean distance [8]. In Figure 3.5, for example, pixel (1,1) was found using the vector A(:,1) and

B(:,1) where A and B are the matrices of the two spectrograms being compared. Pixel (1,2) was created

from A(:,1) and B(:,2), and so on. It is also important to normalize the value in this matrix back down

to reasonable level [3].

14

Figure 3.5: Visualization of Match Matrix

Match Matrix of two audio signals; "a" vs "garbage"

2 4 6 8 10 12 14

2

4

6

8

10

12

14



3.5.3 DTW

Data recognition utilizing DTW has been detailed through various sources, mainly [3],[6], & [9]. The

method in [3] involves the modification of the input and reference signals into their respected

spectrograms before DTW. The formula as given by [6] to solve the cumulative distortion measure:

D(i,j)=d(i,j)+minp(i,j){D[p(i,j)]+T[(i,j),p(i,j)]}Where d is a local measure between frame i of the input and j of the reference, p is the coordinates of

possible predecessors, and T is the associated cost of the transition. This matches well to the formula

given by [3]: D(i+1,j+1)=M(i+1,j+1)+min{M(i,j),M(i+1,j),M(i,j+1)}

The basics of what this formula means is the creation of the D (distortion) matrix from the M (match)

matrix. When one is creating the distortion matrix D(i,j), one begins at position (1,1) and sets this to a

null value. The D(i+1,j+1) value is then created using the match matrix M(i+1,j+1) as the basis, but

adding the value of the lowest "jump" to a progressive pixel.

Simplified, if one is creating the distortion point D(4,2) one begins with the match point M(4,2). One

then looks at the values of M(3,1), M(4,1) and M(3,2) and adds the lowest - the quickest way to get

there. This can be seen in figure 3.6. One can then trace through the distortion matrix to find the

quickest path, and use this value as a means of comparison.

There was also thought put in to a way to create a faster way to trace through the distortion matrix, via

breaking away once outside certain bounds. It's effectiveness would need to be tested to see if the time

saved by breaking early would be more then the time incurred by the added code.

15

Figure 3.6: Visualization of Distortion Matrix Creation



3.5.4 Library

The library will be stored in a file in with the m-files, so that the speech recognition program can easily

access it. There will need to be a simple bit of code created capable of entering new files into the

library.

There were two main choices for the way in which to store the data:

1. Save the signals as Microsoft wav files using the MatLab function wavwrite, and access them

using the MatLab function wavread. Convert each accessed vector into it's spectrogram every

time it is accessed.

2. Save the spectrograms of the signals as delimited text files using the MatLab function

dlmwrite, and access them using dlmread.. Covert each vector into it's spectrogram matrix

only once, before it's saved.

The thought process is that saving as a delimited text file should logically take the program less time to

access the spectrogram - as it won't have to convert it every time - compared to saving as a wav. This

will come at the expense of the library being a larger size, as the spectrogram matrix is much larger in

size then the signal's wave vector. Testing will be needed to determine the better method.

3.6 Returning Results

Upon program activation, the teaching program will pass the value of the entry attempting to be

recognized. Data Output will involve returning a signal of (correct) “character recognized”, “failure to

recognize”, or the (incorrect) recognized character (1-26=a-z; 27-30=Enter, Yes, No, Back) to the

teaching program. Outputs returned were to be sent in the form of:

• 1-30 - incorrect character, outputs the results of pattern recognition (1-30).

• 50 - no satisfactory match.

• 100 - correct character.

Original thought into the interaction of the MatLab speech recognition and the C# teaching program

was to use the MatLab Builder for .Net. This would have created a wrapper to allow the MatLab code

to be run in C#. Mr. Hernandez found another way to access this, using a C# program which only

required the m-files to be in the same directory. This was used in stead.

16



4 Design Procedures

4.1 Speech Recognition Program

Figure 4.1 is the program speechRec.m. In this section, the design of it's components will be outlined

by taking selected code from the relative m-files. For the full code, see Appendix C.

17

function out = speechRec(in)

[audioIn fs]= recorder(); %get audio signal

audioIn = normalizer(audioIn);

audioIn = usefullSig(audioIn);

audioIn = cepAnal(audioIn);

audioIn = hamWindow(audioIn);

audioIn=specCreate(audioIn,fs);

%Comparison loop.

numLibEnt=30; %number of library entries, 1-30

%1-26 being alphabet, 27-30 being enter, yes, no, back.numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

cmin=500; %minimum comparison excepted.

c=cmin;

ctemp=0; %#ok<NASGU>

cp=0.7071; %From experimental data, if the DTW block produces a value of

%0.7071 then this is a perfect match. This value is normalised

%for any size difference, et cetera.

r=0; %r is the variable for which is the current lowest match c.

%if r stays as 0, we therefore never achieved a c lower than

%min and don't have a match.

for m=1:numLibEnt

for n=0:numLibSam

[x fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);

Y=specCreate(x,fs);

M=matchMat(audioIn,Y);

ctemp=abs(DTW(M)-cp);

if (ctemp<c)

c=ctemp;

r=m;

end

end

end

%returning block.

if r==0

out=50;

elseif r==in;out=100;

else

out=r;

end

return

Figure 4.1: m-file code used for Speech Recognition Program



4.2 Data Collection

Data collection was designed through the function recorder.m. This function took in no arguments, and

output the recorded signal (as a 1xn vector) as well as the sampling frequency used to record the audio.

In Figure 4.2 one can see some of the code used created to do this. The sampling frequency was setpermanently to be 8.192kHz - this is a standard value for inputting sound using winsound devices. The

time to record was also permanently set to 3 seconds. If one wished to change this, they'd have to go in

to the code to modify it. Allowing these to be changed via in input was considered, but ultimately

discounted as pointless.

The analogue input was set to be of the type winsound, and only recorded on one channel. Sample Rate

was set to fs and Samples Per Trigger set to t*fs ( to record for the the wanted time). Trigger Type set to

manual - meaning that it would begin when told, as opposed to other options such as triggering on a

rising edge. Trigger Repeat was set to 0, so that there was no repeat.

Out was the data recorded from the analog input.

18

fs = 8192; %in Hz, default sampling frequency for sound(), etc.

t=3; %in s, number of seconds to record for.

ai_length = t*fs;

% Set up MatLab Oscilloscope / Winsound Analoginput

ai = analoginput('winsound');addchannel(ai, 1);

set(ai, 'SampleRate', fs);

set(ai, 'TriggerType', 'manual');

set(ai, 'TriggerRepeat', 0);

set(ai, 'SamplesPerTrigger', ai_length);

% Get data from the microphone

beep on;

beep;

start(ai);

trigger(ai);

data = getdata(ai);

beep;

delete(ai);

out = data; %return the audio input.

Figure 4.2: m-file code used to input audio signals.




4.3.1 Normalization

Normalization was done via the function normalizer.m. This function took in an assumed (1xn) vector

and normalized it to a maximum value of 0.5 via the code seen in Figure 4.3. Then returns the vector.

4.3.2 Windowing

Windowing was done via two functions: hamWindow.m and usefullSig.m. In Figure 4.4 one can see the

key aspects of both. The Hamming Window was created using the window function native to MatLab.

The useful signal extractor was done by running through the signal from both ends, as can be seen in

the sampled code. When the magnitude of the signal is above a threshold, this value is recorded as the

value at which to clip, minus an offset. These end values are stored in a and b, and the ends passed

these values are chopped off. Note that as and ab will be set to 0 once a value is found, ensuring that no

second value will be recorded (as the if statement will always be false).

4.3.3 Cepstral Filtering

Cepstral filtering was achieved via the creation of the function cepAnal.m. Figure 4.5 shows the

important code: the pushing of the audio signal into the quefrency, creation of a mask to remove

unwanted quefrecies, then the return to the time domain.

19

x = 0.5*x/max(abs(x));

Figure 4.3: m-file code used to normalize.

w=window(@hamming,length(x)); x=x.*w;

____________________________________

for i=1:l

if (as && abs(x(i,1))>thresh)

a=i-os;

as=0;

end

if (bs && abs(x(l-i,1))>thresh)

b=l-i+os;

bs=0;

end

end

Figure 4.4: m-file code used to window.

c=cceps(x);

pass=int16(length(c)/6); mask=ones(length(c),1);

mask(pass:length(c)-pass,1)=mask(pass:length(c)-pass,1)-1; c=c.*mask;

x=icceps(c);

Figure 4.5: m-file code used to perform Cepstral Filtering.




Figure 4.6 is the main comparison work of the program speechRec.m. The first step in this code is to

define the size of the library. There are two main components to this: the number of library entries, and

the number of samples. For the purpose of testing this program, a library with 30 entries had been

created, with entries 1-26 corresponding to a-z, as well as the four commands "Enter", "Yes", "No", and"Back" (27-30, respectively). There were three samples of each entry (0-2).

It was decided that the variable to store the best match would be called "c". A cmin was then establish,

being the minimum acceptable value which would be recorded. If no c being returned in later stages

could been the cmin, then it means that there was no recognizable input. A ctemp was also made to

hold returned c values temporarily. Finally, cp (p for perfect) and r were initialized. The cp value was

found through testing to be 0.7071 - that is to say that if the DTW finds a perfect match, it will return a

value of 0.7071 (see 4.4.3). The r is a variable which will hold the entry number which the current best

c falls under, and will be used in the return phase.

The algorithm itself is very simple. There are two nested for loops which will run through every sample

20

%Comparison loop.


%1-26 being alphabet, 27-30 being enter, yes, no, back.

numLibSam=2; %number of library samples (ie. 0 to 5 entries of 'a').

cmin=0.05; %minimum comparison excepted.

c=cmin;








for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(x,fs);



if (ctemp<c)

c=ctemp;

r=m;

end end

end

Figure 4.6 m-file code for comparison loops.





4.4.2 Match Matrix

The match matrix is constructed using matchMat.m as seen in Figure 4.8. The two input spectrograms

A and B are manipulated in order to gain a normalized match matrix via the method described in 3.5.2.

4.4.3 DTW

DTW.m accepts the match matrix as an argument and returns a value of match goodness. It begins by

creating the distortion matrix as described in 3.5.3, as can be seen in Figure 4.9. As discussed in 2.2, the

value in the bottom right corner of the distortion matrix can be used for classification purposes.

Unfortunately, as seen in section 5.3 there were difficulties with this method relating to the size of the

library entries and an inability to effectively normalize them. As such, the second method for

classifying match goodness discussed in 2.2 had to be used. This involved tracing through the

distortion matrix from it's top left corner using the phi values (which stored weather the path of least

resistance was be going right, down, or right and down in one step).

This value can then be easily normalized for varying sizes of match matrices by dividing the trace

length by the diagonal (as a perfect trace should be a line straight down the diagonal). The method of tracing code used (see Appendix C) resulted in a perfect match returning a value - after normalization,

or 0.7071.

22

sA= sqrt(sum(A.^2));

sB = sqrt(sum(B.^2));

M = (A'*B)./(sA'*sB);

Figure 4.8: m-file code used to create match matrix.

%create matrix and variables.

for i=1:m

for j=1:n

[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);

D(i+1,j+1)=D(i+1,j+1)+dmax;

phi(m,n)=tb;

end

end

% Tracing Code: see Appendix C.

out=out/sqrt(m^2+n^2); %divide by diagonal so that all answers are equally

weighted.

Figure 4.9: m-file code used to do DTW.



Also created was an attempt to speed up the DTW.m by stopping the trace is one went "out of bounds".

As one knows that a good match will go roughly right down the centre diagonal, one can then postulate

that if the trace is running outside a certain area, it can immediately be discounted as a poor match.

Figure 4.10 visualizes this. On the left is an ideal good region in the middle [6], while on the right is a

somewhat simpler means of implementing the concept.

In the breaking code used in the function DTWTHREE.m, seen in Figure 4.11, the p value is the

vertical position of the trace, and the q is the horizontal. To implement what's seen in Figure 4.10, one

takes the vertical size of the distortion matrix (ie 14). One then assumes that a sixth of this value (ie~2)

is how close one wants the vertical trace to remain to the centre. If the vertical value (p) is greater or

less than this distance from the ideal line (the diagonal), then it is out of bounds and the trace is ended

early. The ideal vertical point p for horizontal point q is found by finding the angle of the idea diagonal

(tan-1(opposite/adjacent)), then finding opposite=adjacent*tan(angle) where the adjacent is q.

23

ideal=q(1,1)*tan(atan(m/n));

ideal1=ideal+m/6;

ideal2=ideal-m/6;

if (p(1,1)>ideal1)

i=-1; %easy way to stop the while loop.

p=m*n; %Some high value.

end

if (p(1,1)<ideal2)

i=-1; %easy way to stop the while loop.

p=m*n; %Some high value.

else

Figure 4.11: m-file code used to do modified trace.

Figure 4.10: Visualizing the idea behind a faster approach.



4.4.4 Library

The library was created manually with the function libCreate.m as seen in Figure 4.12.


Results were returned at the end of speechRec via the code seen in Figure 4.13. The r value (as noted insection 4.4) is the library entry for which the input audio best matches. If the r remained as a 0 through

the program, this means that no library entry was close enough to warrant a match and so the output

will be 50 (the designated code number for "no recognition" as understood by Mr. Hernandez's teaching

software). If the r value matches what the teaching software told it upon starting is the correct answer,

is will output 100 (the designated code for "recognition of correct value"). If neither of these are the

case, it will return the value of the library entry for which it thought it recognised.

24

fs=8192;

audioIn = recorder(); %get audio signal





libNum=input('Please input number to be associated with file (ie 10-

999).','s'); %String.

wavwrite(audioIn,fs,['Library/lib' libNum '.wav']);

Figure 4.12: m-file code used to create library entries.

if r==0out=50;

elseif r==in;

out=100;

else

out=r;

end

Figure 4.13: m-file code used to return results.



5 Results and Discussion

5.1 Data Collection

As seen in Figure 5.2, the recorder.m function had a response time of a little over three seconds. This is

expected, as the input was set to record for three second. It functioned as designed (section 4).


Figure 5.1 allows one to view the appearance as it runs through the stages of data manipulation. It

functioned as designed (section 4).

25

Figure 5.1: The phases of data manipulation.

0 1 2 3

x 104

-0.4

-0.2

0

0.2

0.4

0.6

Time Axis

S i g n a l M a g n i t u d e

Normalized sound

0 500 1000 1500 2000-0.4

-0.2

0

0.2

0.4

0.6

Time Axis


Useful sound

0 500 1000 1500 2000-0.5

0

0.5

Time Axis


Post Cepstral Filtering

0 500 1000 1500 2000-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Time Axis


Window`d (hamming) sound



Cutting away everything but the useful signal creates variable response times. In Figure 5.2, one can

see the times taken by the various functions (y-axis) based on the size of the input wav in kB (x-axis).

5.2.1 Normalization

normalizer.m times increased linearly with increased wav size. It functioned as designed (section 4).

5.2.2 Windowing

usefullSig.m times appeared strange due to the tested files already having been run through

usefullSig.m at their point of creation. The time of usefullSig.m in reality is always constant, and is

based on the length of the audio signal (ie 3 seconds * 8192 samples per second). It functioned as

26

Figure 5.2: Timing measurements for input files of differing sizes.

0 5 10 150

1

2

3

4x 10

-4 Times of nomalizer.m vs Size

Size in kB

t

( i n s )

0 5 10 150

1

2

3

4x 10

-4 Times of usefullSig.m vs Size

Size in kB

t ( i n s )

0 5 10 150

0.005

0.01

0.015

0.02Times of capAnal.m vs Size

Size in kB

t ( i n s )

0 5 10 150

0.5

1

1.5x 10

-3Times of hamWindow.m vs Size

Size in kB

t ( i n s )

0 5 10 150

1

2

3x 10

-3 Times of specCreat.m vs Size

Size in kB

t ( i n s )

1 2 3 40

1

2

3

4Times of recorder.m for Four trials

Trial

t

( i n s )



designed (section 4).

hamWindow.m times increased linearly with increased wav size. It functioned as designed (section 4).

5.2.3 Cepstral FilteringcepAnal.m times increased linearly with increased wav size. It functioned as designed (section 4).


5.3.1 Spectrogram

specCreat.m times increased linearly with increased wav size. It functioned as designed (section 4).

Figure 5.3 shows the spectrograms of three sounds: those of 'c', 'b', and 'w'. Note how 'c' and 'b' are

rather similar, while 'w' appears quite different. This is becomes a minor problems, as when one is

testing for recognition of 'b', one will often see close matches with 'c', 'd', 'e' and other letters which

share similar sounds.

27

Figure 5.3: Examples of Spetrograms

created by specCreate.m for letters 'c', 'b',

and 'w'.

Time

F r e q u e n c y

Spectrogram of Input Sound `c`

0.05 0.1 0.15 0.2 0.250

1000

2000

3000

4000

Time

F r e q u e n c y

Spectrogram of Close Library Sound `b`

0.05 0.1 0.15 0.20

1000

2000

3000

4000

Time

F r e q u e n c y

Spectrogram of Far Library Sound `w`

0.1 0.2 0.3 0.40

1000

2000

3000

4000



5.3.2 Match Matrix

matchMat.m times increased linearly with increased wav size as seen in Figure 5.4. It functioned as

designed (section 4).

5.3.3 DTW

DTW.m times increased linearly with increased wav size as seen in Figure 5.4.

There were numerous challenges in the creation of DTW. Following is a truncated discussion of the

results of the early versions of DTW, as well as the changes which were made as a result. For the full

thought process (in rough formed notes) refer to Appendix B.

28

Figure 5.4: Timing measurements for input files of differing sizes.

2 4 6 8 10 12 140

2

4

6x 10

-3 Times of matchMat.m vs Size

Size in kB

t ( i n s )

2 4 6 8 10 12 140

0.005

0.01

0.015Times of DTW.m vs Size

Size in kB

t ( i n s )



Originally, as documented in 4.4.3 and 2.2, the DTW.m used the bottom right value of the distortion

matrix as value of match goodness. Initially, the diagonal length of the match matrix was attempted to

be used as a normalizing factor. When testing began to determine the efficiency of the program, results

were extremely poor. A constant feature noticed was the tendency for the speech recognition program

to recognize the input speech as 'w' 90% of the time [14]. It was theorized that the normalization viadivision of the diagonal was not working as wanted.

Figure 5.5 from Test File 5 (see Appendix C) involved the running of all library entries through the

pattern recognition code. This guaranteed a perfect match, and if the code was working the way it was

designed to, each perfect match would return a consistent c value.

As can be seen, the c values vary wildly. The same experiment was attempted repeatedly, with different

values in place of the diagonal as an attempt to find a means of normalizing the c values. Theseincluded:

• No normalization.

• Multiplication by diagonal.

• Division/Multiplication by area.

29

Figure 5.5: Results of testing audio files against themselves to see results, as well as c values returned.

x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

0 20 40 60 80 10099

99.2

99.4

99.6

99.8

100

100.2

100.4

100.6

100.8

101Results (100=match)

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5x 10

-16 Value of C variable for match



The results of these further tests were much like Figure 5.5.

Figure 5.6 from testFileSeven (see Appendix C) involved taking a known letter (w=23, a=1) and

running it through the pattern recognition code. This test recorded the c values returned by every

library entry sample.

One can see that for testing of w (the left of Figure 5.6) the c values very wildly, though there is some

pattern due to the three sample for each entry having more in common with other entries than samples

(ie. entry 1 and 2 sample 1 are more similar than entry 1 sample 1 and 2) due to different people

creating them.

The minimum value is the perfect match - zero, as expected by the theory from the literature (section

2.2). The x-axis of Figure 5.6 represents the library entry samples, and 67-69 refers to w's.

In the right portion of Figure 5.6, the results for the testing of a are seen. While the perfect match isfound at a (as expected), the next best match is at w with a c value of 0.0027, despite a and w sounding

nothing alike. It should be noted that w @ 69 is also the largest file in the library.

While a perfect match will return the correct recognized character, even the slightest variation will be

30

Figure 5.6: Testing for known w (23) and a(1) to see what sort of c values are returned for all

library entries. x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)



beaten by w. It can therefore be surmised that the failure to find an effective way to normalize the

bottom right value of the distortion matrix is responsible for the extremely poor efficiency and

extremely high number of false positives (almost all of which are w).

Figure 5.7 is from testFileEight (see Appendix C). It allows the visualization of the c values given

when 'a' is run through the pattern recognition, plotted along side the relative size of the library

entrances.

C values are in red, sizes in blue. One can see that - other than for a perfect match - the best c values all

correspond to the largest data files. [Note that Figure 5.7 was created using area division in the DTW.]

This allows one to conclusively say that the reason for the "w problem" is that the larger file size needs

to be normalized in some way. A method to do this was not found, resulting in the necessity of

31

Figure 5.7: Value of c returned for entry of a (1) in red plotted against the proportional size of the input

file. x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

40

45

50

Value of C (red) v Size(b) for 1



changing to the second method of determining match goodness as described in 2.2 and 4.4.3: the trace

back through the distortion matrix as seen in Figure 5.8.

32

Figure 5.8: Demonstrating the workings of matchMat.m and DTW.m. Left column is the local

match matrix for perfectly, somewhat, and poorly matches input spectrograms ('b'&'b', 'b'&'c',

and 'b'&'w' respectively). To the right of these are representations of the path of least resistance

taken by the DTW block in order to find a best match value.

Perfectly Matching Input Specs

2 4 6 8 10 12 14

2

4

6

8

10

12

14

Match Matrix (left), Quickest Path (right)

2 4 6 8 10 12 14

2

4

6

8

10

12

14

Somewhat Matching Input Specs

5 10 15

2

4

6

8

10

12

14

5 10 15

2

4

6

8

10

12

14

Poorly Matching Input Specs

5 10 15 20 25

2

4

6

8

10

12

14

5 10 15 20 25

2

4

6

8

10

12

14



Figure 5.9 below allows for the visualization of the returned c values using the trace method. It was

created using a (1) as the input signal to match against, and was run through the pattern recognition of

only one sample range (hence the 1-30 x-axis, as opposed to the 1-90 x-axis seen in previous figures).

One can see that with this setup, a perfect match is the value of 0.7071. One can see in Figure 5.9 thatlibrary entries that are like 'a' are also found around this value, while library entries which are far off

(like 'w') are quite distant to this value. This greatly reduced the number of false positives being found

and cured the "w problem", coinciding with a noticed improvement in efficiency.

This method did however have some unfortunate problems, as can be seen in Figure 5.10. The problem

is simple and very hard to correct: in the portion of the pattern recognition as detailed in section 4.4

there is included a piece of code ctemp=abs(DTW(M)-cp). This is necessary to find a value close to 0

33

Figure 5.9: the returned c values for the traceback method using a=1 as the input.

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

library entry

c v a l u e

c values for library entries



to be the best match goodness measure, so that the value can be compared with the previous best match

goodness.

As can be seen in Figure 5.10, where a similar test to that as seen in Figure 5.5 is run using the new

method, when testing a file in the library through the pattern recognition it does not always equal amatch! The reason for this is that the DTW(M)-cp has a minimum value of 6.781186547510920e-06; it

won't calculate 0 even when given a perfect match. This results in situations where more than one

library entry will give this value, meaning that the recognized character is the last time this happened

instead of the exact match.

While it would have been nice to fully solve this problem, the solution to this point was deemed good

enough for the purposes of this project.

34

Figure 5.10: Results of testing audio files against themselves to see results, as well as c values

returned. x-axis is the corresponding library entrance (1-3=a, 4-6=b, etc.)

0 20 40 60 80 1000

20

40

60

80


0 20 40 60 80 1006.7812

6.7812

6.7812

6.7812

6.7812x 10




In 4.4.3 a method for possibly speeding up the tracing was outlined. It involved breaking out of the

trace while loop if one ran out of certain bounds while tracing. In testFileTwenty (see Appendix C) a

large number of DTW and DTWTHREE (DTW+breaking code) are run and timed. The results can be

seen in Table 5.1 for three runs of testFileTwenty.

The addition of code to catch out of bounds appears, over an extreme sample, less efficient (or

negligibly different) than not having the code.

From testFileTwentyOne was created Figure 5.11, to help visualize this. It test the time to run DTW

and DTWTHREE when first doing a known perfect match (x=1), then a known poor match which

would trigger code (x=2).

Conclusion: the increased chance of having done something wrong is not worth the negligible benefit.

35

Table 5.1: Results of testing for time difference between DTW (t0) and DTWTHREE (t1-with breaking

code) over a large number of averaged runs.

Figure 5.11: Visualisations of time it takes for DTW (blue) and DTWTHREE (red). Done for a known

perfect match (x=1), then a known poor match which would trigger code (x=2).

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

0.5

1

1.5

2x 10

-3 DTW in blue, DTWTHREE in red



5.3.4 Library

testFileNine was a simple test on time it takes to do comparison algorithms vs how many samples are

in the library. Figure 5.12 shows this to have a linear relationship.

Testing was also done on the two options discussed in 3.5.4, repeated here:

There were two main choices for the way in which to store the data:

1. Save the signals as Microsoft wav files using the MatLab function wavwrite, and access them

using the MatLab function wavread. Convert each accessed vector into it's spectrogram every

time it is accessed.

2. Save the spectrograms of the signals as delimited text files using the MatLab function

dlmwrite, and access them using dlmread.. Covert each vector into it's spectrogram matrix

only once, before it's saved.

Testing concluded in the results seen in Table 5.2 and Figure 5.13.

36

Figure 5.12: Times to run pattern recognition vs size

of library (defined as number of samples).

0 2 4 6 8 10 12 14 16 18 200

0.02

0.04

0.06

0.08

0.1

0.12Time to run through comparison vs number of samples in library

Number of samples

t ( i n

s )

Table 5.2: Average Time (T) of opening and Size (S) of wav and txt files.



Unexpectedly, it was found that not only did option 1 save on file size (as expected) it also operated

much faster. This means that the function dlmread is actually slower then wavread and specCreate

combined. An interesting result.


Due to constraints in finding an efficient way to do speech recognition results were rather poor, with

efficiency around 25% as seen in Table 5.3. Interestingly, once the tester was able to get into a groove

of saying the letter in a certain way (obviously matching a library entry), efficiency could spike to

100% as seen in the second w testing. Efficiency would therefore be improved with a larger library.

37

Figure 5.13: Results of testing comparing size (bottom) and speed (top) of stored .wav's

(red) vs. .txt (blue) files producing spectrograms.

1 1.5 2 2.5 3 3.5 40

0.005

0.01

0.015

0.02

0.025

t ( i n

s )

Trial number

.wav in red, .txt in blue

1 1.5 2 2.5 3 3.5 40

20

40

60

80

100

120

s i z e

( i n

k B )

Trial number

Table 5.3: Results of Speech Recognition. Top row is trial number, values are those returned.



6 Conclusions and Recommendations

6.1 Conclusions on Project Objectives

The efficiency of ~25% was far below the wanted efficiency of 60%. However, it was determined that

this was due to the small library size (only three samples per entry). With a larger number of samples in

the library, it can be stated with confidence that efficiency would improve.

The use of Dynamic Time Warping for speech recognition was proven to be a viable method. That said,

it's returns are poorer then one would hope due in large to the fact that it is very difficult to normalize

the match goodness values.

Two interesting and unexpected discoveries were made during the course of the project. The first

pertains to the creation of the library for use in DTW pattern recognition. While it was known thatsaving files in the Microsoft wav format would save physical space compared to saving as a delimited

text file, it was assumed that not having to convert into a spectrogram after reading would give the

delimited text file an edge in computational speed. However, testing showed that the function dlmread

(when reading in the spectrogram) was in fact slower then the combination of wavread (reading in the

recorded audio) plus specCreate.m (which converts the audio into its spectrogram).

The other discovery was in the attempt to increase the speed of the DTW by putting a check in trace

code. It was thought that by finishing once an out of bounds situation was achieved, it might be

possible to speed up the computational time. However, it was found that any speed gained from

breaking early was negligible. Whether this was due to the increased time of checking the if statements,

or if the process was already fast enough that any difference was statistical noise was not explored.

6.2 Recommendations

While the Speech Recognition presented is viable, the effectiveness is below that which can be found in

market for comparable price.

38



Appendix A: Computer Software Design Tools

C#

C# is a multi-paradigm programming language encompassing object-oriented (class-based)

programming disciplines. It is a Microsoft product within the .NET initiative.

Microsoft Visual C# 2008 Express Edition was used during the creation of this project (mainly the

component created by Mr. Hernandez). It was a free program with registration.

Mathwork's MATLAB

MATLAB stands for "Matrix Laboratory" and is a numerical computing environment developed by

MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation of

algorithms, and interfacing with programs written in other languages. It excels in the manipulation of matrices.

A 2007 Student Edition was used during the creation of this project. It is available from Mathworks at a

cost of $99USD. [I already had a copy.]

Data Acquisition Toolbox

From Mathworks: "Data Acquisition Toolbox™ software provides a complete set of tools for analog

input, analog output, and digital I/O from a variety of PC-compatible data acquisition hardware. The

toolbox lets you configure your external hardware devices, read data into MATLAB® and Simulink®

environments for immediate analysis, and send out data."

It is available from Mathworks for $29USD.

39



Appendix B: Additional Testing NotesNote: In online copy only. For the physical copy, these have been removed. If one wishes to view thiscode, they may contact the writer for information into such.TESTING AFrom Test File 5

If the normalization was working as I wanted it to, this should be consistent value.

Try removing the divided by in DTW. Results:

The division is not what is causing the problem.

40

0 20 40 60 80 10099

99.2

99.4

99.6

99.8

100

100.2

100.4

100.6

100.8

101

Results (100=match)

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5

x 10-16 Value of C variable for match

0 20 40 60 80 10099

99.2

99.4

99.6

99.8

100

100.2

100.4

100.6

100.8


0 20 40 60 80 1000

1

2

3

4

5

6

7

8x 10




Want to check consistency, will return division and see if it is as before, then work on creating bettersystem.

So, yes. Good, at least there's no funny bug. Consistent results.

Second replacement, using area instead of diagonal value.

Area and diagonal produce basically same results.

41

0 20 40 60 80 10099

99.2

99.4

99.6

99.8

100

100.2

100.4

100.6

100.8


0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5x 10


0 20 40 60 80 10099

99.2

99.4

99.6

99.8

100

100.2

100.4

100.6

100.8


0 20 40 60 80 1000

1

2

3

4

5

6

7

8

9x 10




TESTING BFrom testFileSevenTesting to see what kind of c values a single letter gets.

Okay, we see 69 matching (ie good match for w).

42

0 10 20 30 40 50 60 70 80 900

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018Value of C variable for 23



Look at a:

3 matches at 4.9343e-019, which is expected. But next closest is 69 at 0.0027. So here's the w problem.

43

0 10 20 30 40 50 60 70 80 900

0.005

0.01

0.015

0.02

0.025

0.03

Value of C variable for 1



TESTING CFrom testFileEight.Want to look at correspondence between c value and file size.

Can actually see an anti-correlation. Note that this is using an area division in the DTW.

44

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

40

45

50Value of C (red) v Size(b) for 1



Going to remove this and try again.

Again, anti-correlation between file size and the c value being produced.

That is to say, bigger files are producing smaller c values. Hmmmm.

Also not that 'w's are some of the biggest files, perhaps accounting for the tendency to run to 0.

45

0 10 20 30 40 50 60 70 80 900

2

4

6

8

10

12

14




What if I multiplied instead? Here are the results if DTW multiplied by area:

We see that now we have correlation, which we don't want either.

46

0 10 20 30 40 50 60 70 80 900

2000

4000

6000

8000

10000

12000




Testing DFrom testFileNine.Simple test on time it takes to do comparison algorithms vs how many samples are in the library.

47

0 2 4 6 8 10 12 14 16 18 200

0.02

0.04

0.06

0.08

0.1

0.12Time to run through comparison vs number of samples in library

Number of samples

t ( i n s )



Testing EFrom testFileTwoHere is the comparison between opening up the files as .wavs and converting to spectrograms andsaving as .txt files.

Can see that the .wav's are, strangely, both processed faster and stored in smaller files.

Twav =0.0046s Ttxt =0.0155sSwav =8.0350kB Stxt =67.1000kB

Interesting to note that the txt is proportional, but wav is not.

48

1 1.5 2 2.5 3 3.5 4

0

0.005

0.01

0.015

0.02

0.025

t ( i n s )

Trial number

.wav in red, .txt in blue

1 1.5 2 2.5 3 3.5 40

20

40

60

80

100

120

s i z e ( i n k B )

Trial number



Testing FFrom testFileTen.Very similar to test file two. Going to get timings for various bits in relation to the size of the wav.

49

2 4 6 8 10 12 140

1

2

3

4x 10

-4 Times of nomalizer.m vs Size

Size in kB

t ( i n s )

2 4 6 8 10 12 140

1

2

3

4x 10

-4 Times of usefullSig.m vs Size

Size in kB

t ( i n

s )

2 4 6 8 10 12 140

0.005

0.01

0.015

0.02Times of capAnal.m vs Size

Size in kB

t ( i n

s )

2 4 6 8 10 12 140

0.5

1

1.5x 10

-3 Times of hamWindow.m vs Size

Size in kB

t ( i n s )

2 4 6 8 10 12 140

1

2

3x 10

-3 Times of specCreat.m vs Size

Size in kB

t ( i n s )

1 1.5 2 2.5 3 3.5 40

1

2

3

4Times of recorder.m for Four trials

Trial

t ( i n s )



Testing GFrom testFileEleven

Looking at the matchMat and DTW timing for different sized wavs.

50

2 4 6 8 10 12 140

1

2

3

4

5

6x 10

-3 Times of matchMat.m vs Size

Size in kB

t ( i n s )

2 4 6 8 10 12 140

0.002

0.004

0.006

0.008

0.01Times of DTW.m vs Size

Size in kB

t ( i n

s )



Testing HFrom testFileTwelve

Timing speechRec.m for various sizes of arrays.

Note: sample 6 of rSpec was removed as it was an extremely large size and made other results hard toread.

51

0 2 4 6 8 10 12 14 16 18 200

0.5

1

1.5x 10

-3 Ratio between time and size (in s/ArraySize)

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

Time of speechRec.m in blue for trials (in s)

0 2 4 6 8 10 12 14 16 18 200

1

2

3x 10

4 Size of audio in files in red (in Array Size)



tSpec4.72751166698561 5.50082608264458 5.53807886197827 5.020779666130755.79221702758311 3.87255475207044 5.82172762180668 5.928604892521264.43773356669633 5.57504354603728 5.82292959021328 4.652102851060685.06504150667194 4.07166958370407 5.85625526428638 6.101567308135535.79194227199267 5.74865583474995 5.91190165230497 3.85032357464426

sSpec10454 17240 18265 12890 20603320 20223 20626 8183 1801820620 10064 13402 5017 2059620603 19634 19305 18471 3005

rSpec0.000452220362252306 0.000319073438668479 0.0003032071646306200.000389509671538460 0.000281134641925114 0.01210173360022010.000287876557474494 0.000287433573767151 0.0005423113242938200.000309415226220295 0.000282392317663108 0.0004622518731181120.000377931764413665 0.000811574563225847 0.0002843394476736450.000296149459211548 0.000294995531832162 0.0002977806700207170.000320063973380162 0.00128130568207796

52



Testing IFrom testFileThirteen

This is comparing the c values of the three a's in the library currently.

This is for DTW multiplying by area. Ideally, when we have a perfect match, the results should be thesame no matter the size of the array.

53

1500 2000 2500 3000 3500 4000 45000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 x 10

-13

c v a l u e r e t u r n e d f o r p e r f e c t m a t c h

size of array

c value returned for letter "1"



Here is the result for dividing by diagonal, and dividing by area respectively.

54

1500 2000 2500 3000 3500 4000 45000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x 10

-16


size of array




55

1500 2000 2500 3000 3500 4000 45000

0.2

0.4

0.6

0.8

1

1.2

1.4x 10

-17


size of array




And here is with no factoring due to size:

56

1500 2000 2500 3000 3500 4000 45000

0.5

1

1.5

2

2.5

3x 10

-15


size of array




Testing JFrom testFileFourteen

Getting c's across a certain numLibSam.

This is with DTW having no factoring:

MAJOR PROBLEM:A PERFECT MATCH SHOULD BE A PERFECT MATCH SHOULD BE A PERFECT MATCH.

57

1000 2000 3000 4000 5000 6000 7000

0

1

2

3

4

5

6

7

8x 10

-15

c v a l u e r e t u r n e d f o

r p e r f e c t m a t c h

size of array

c value returned for sample "2"



Here's divided by area, multiplied by area, divided by diagonal:

58

0 1000 2000 3000 4000 5000 6000 70000

1

2

3

4

5

6

7

8

9x 10

-17

c v a l u e r e t u r n e d f o r p e

r f e c t m a t c h

size of array


0 1000 2000 3000 4000 5000 6000 70000

0.2

0.4

0.6

0.8

1

1.2x 10

-11


size of array


0 1000 2000 3000 4000 5000 6000 70000

0.5

1

1.5

2

2.5

3

3.5x 10

-16


size of array




Testing KFrom testFileFifteen

Want to see if the size of the spectrograms is affecting the c. Not sure if there's a difference betweenarray size and spectrogram size.

This is for no factoring.

59

0 2000 4000 6000 8000 10000 12000 14000

0

2

4

6

8x 10

-15


area of spec


0 500 1000 1500 2000 25000

2

4

6

8x 10

-15

c

v a l u e r e t u r n e d f o r p e r f e c t m a t c h

area of match matrix



WONDERING IF SIZE OF SPECTROGRAM HAS TO DO WITH IT.

IF I CAN STANDARDIZE THESE, WILL IT IMPROVE?

Changed specCreate.m to have X=specgram(x);Result:

Note: This will cause an error in matchMat.m as dimensions will no longer agree.

Here, all the match matrices are either 7*7 or 8*8.

60

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

0.2

0.4

0.6

0.8

1x 10

-14

c v a l u e r e t u r

n e d f o r p e r f e c t m a t c h

area of spec


0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1x 10

-14


area of match matrix



Testing LFrom testFileSixteen

With division:

No division:

61

0 5 10 15 20 25 30-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1x 10

-17

library entry

c

v a l u e


0 5 10 15 20 25 30-8

-6

-4

-2

0

2

4

6

8 x 10

-15

library entry

c v a l u e




In respect to size, no division:

Putting division back in:

62

5 10 15 20 25 30 35 40 45 50

-0.5

0

0.5

x 10-14

library entry size

c v a l u e


5 10 15 20 25 30 35 40 45 50-10

-8

-6

-4

-2

0

2x 10

-17

library entry

c v a l u e




Crazy thought!Do p q trace through. Take these and divide by diagonal.

With these changes, results for c by size are:

63

5 10 15 20 25 30 35 40 45 50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

library entry

c

v a l u e




Here I'm going to go through the tests of the previous testing data with the new DTW.

Testing A

Woops. Not a success after all.. Why does it seem to be working in testFileSixteen but not other testfiles?

Creating testFileSeventeen to mimic testFileSixteen except that will use DTW.m.

Here's with all perfect matches (x axis is library entry size):

64

5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

library entry

c v a l u e




Now to test all against 'a':

Perfect match is about 0.7071.

Going to modify speechRec so that it is closest to 0.7071 rather than lowest which is match.

65

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

library entry

c v a l u e




Results of testFileFive post modification.

Doesn't make sense. A constant c value? And all of them are 6.781186547621942e-06?

Something's not working the way I think it's working.

Why do I have outs=0? That means r is never changing, ie we never get abs(DTW(M)-cp)<cmin.

But that should be impossible as I've determined that a perfect match produces a value of 0.7071, andI'm guaranteed to get at least one perfect match, producing a c~0.

Code must not work like I'm thinking it does.

Mistake in the code caused some problems: had it as 1:num instead of 0:num!

Running through code, problem becomes apparent. 6.781186547510920e-06 is the smallest numberMatLab can produce! As a result, multiple returns are all giving this values back to me.

66

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90


0 20 40 60 80 1000

1

2

3

4

5

6

7x 10




Okay, getting shit results. Not much to do about that now.

New day, new idea.

By taking length of q or p, am I not getting the number of steps? I believe so. This make results closerthen they actually are.

Nope, this is good.

Testing BFrom testFileTwenty.

Testing if the code I've written to pop out of the back trace early actually has an effect on speed.

Withideal1=ideal+m/6;

Results: t0 =37.1087 t1 = 37.2153t0 =37.0801 t1 = 38.6007t0 =37.1170 t1 = 37.4828

Result: As suspected, the addition of code to catch out of bounds appears, over an extreme sample, lessefficient than not having the code.

Testing CFrom testFileTwentyOne

Testing the results when first do a perfect match (x=1), then a known broken (x=2).Graphical results of a five of trials below. Appears as if statistical noise.

67

0 20 40 60 80 1000

20

40

60

80


0 20 40 60 80 1006.7812

6.7812

6.7812

6.7812

6.7812x 10




Conclusion: the increased chance of having done something wrong is not worth the negligible benefit.

Testing D

From testFileTwelve

Showing c values of 'b' matches.

68

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

0.5

1

1.5

2x 10

-3 DTW in blue, DTWTHREE in red



Testing Successful Recognition

Letter 1 2 3 4 5 6 7 8 9 10 %

69

0 5 10 15 20 25 300

5000

10000

library entry

c v a l u e


0 5 10 15 20 25 300

10

20

30

40

library entry

S i z e o f M a t c h

M a t r i x

size of match matric for library entries

0 5 10 15 20 25 30 350

5000

10000

Size of Match Matrix

c v a l u e

c values for Size of Match Matrix





Appendix C: Code of Software ElementsNote: In online copy only. For the physical copy, these have been removed. If one wishes to view thiscode, they may contact the writer for information into such.%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the

% Electrical and Computer Engineering Project

% Submitted in partial fulfillment of the requirements for the degree% of Bachelor of Engineering at McMaster Universtiy

%

% To be used in conjunction with projects of

% Chris Agam & Jon Hernandez

%

% File: libCreate.m

% Author: Brett A. Lindsay 0648981

% Required Files:

%

% Function: cepAnal.m will perform cepstral analysis on the input

% signal in order to remove the effects of the speakers

% vocal tract.

%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = cepAnal(x)

c=cceps(x);

pass=int16(length(c)/6);

mask=ones(length(c),1);

mask(pass:length(c)-pass,1)=mask(pass:length(c)-pass,1)-1;

c=c.*mask;

x=icceps(c);

out = x;

return

71



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the


% Submitted in partial fulfillment of the requirements for the degree

% of Bachelor of Engineering at McMaster Universtiy

%


% Chris Agam & Jon Hernandez%

% File: DTW.m


% Required Files:

%

% Function: DTW.m returns the value of the quickest path through the

% local match matrix of two audio signals (M), normalised.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = DTW(M)

M=1-M; %need to find lowest path.

[m,n]=size(M);

D=zeros(m+1,n+1); %create matrix to trace through.

D(1,:) = NaN;

D(:,1) = NaN;

D(1,1)=0;

D(2:m+1,2:n+1)=M;

phi=zeros(m,n);

for i=1:m

for j=1:n

[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);

D(i+1,j+1)=D(i+1,j+1)+dmax;

phi(m,n)=tb; end

end

% figure,imagesc(D),colormap(gray);

i=m;j=n;p=m;q=n;

% out=0;

while i>1 && j>1

tb=phi(i,j);

if tb==1

i=i-1;

j=j-1;

elseif tb==2i=i-1;

elseif tb==3

j=j-1;

else

break;

end

p=[i,p];

q=[j,q];

72



% out=out+1;

end

%portion for returning trace value.

out=0;

if (p(1,1)>1)

out=p(1,1);

elseout=q(1,1);

end

out=(out+length(p)-1)*10000;

% D=D(2:m+1,2:n+1);

% out=D(size(D,1),size(D,2));

out=out/sqrt(m^2+n^2); %divide by diagonal so that all answers are equally

weighted.

end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the




%



%

% File: DTW.m


% Required Files:

%

% Function: .%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = DTWORIGINAL(M)


[m,n]=size(M);

73





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% Attempting to create a faster working DTW by capping the trace through.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = DTWTHREE(M)


[m,n]=size(M);

D=zeros(m+1,n+1); %create matrix to trace through.

D(1,:) = NaN;

D(:,1) = NaN;

D(1,1)=0;

D(2:m+1,2:n+1)=M;

phi=zeros(m,n);

for i=1:m

for j=1:n[dmax,tb]=min([D(i,j),D(i,j+1),D(i+1,j)]);

D(i+1,j+1)=D(i+1,j+1)+dmax;

phi(i,j)=tb;

end

end

% figure,imagesc(D),colormap(gray);

i=m;j=n;p=m;q=n;

while i>1 && j>1

tb=phi(i,j);

if tb==1

i=i-1;

j=j-1;

elseif tb==2

i=i-1;

elseif tb==3

j=j-1;

else

break;

end

p=[i,p];

q=[j,q];

%Breaking code.

%p is verticle, q is horizontal of trace.

%Idea: %Take the verticle size of the matrix ie 14

%third it=~4

%if the verticle value is greater or less than this distance from the

% ideal p, then it's too far out.

% ideal p=q*tan(atan(m/n));

%ie if at q=10, if p<3 or p>17, it's a poor match.

ideal=q(1,1)*tan(atan(m/n));

75






% Required Files: speechRec.m

%

% Function: hamWindow.m will apply a hamming window to the signal,

% before returning the data.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = hamWindow(x)

w=window(@hamming,length(x));

x=x.*w;

out = x;

return

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the




%


% Chris Agam & Jon Hernandez%

% File: libCreate.m


% Required Files: recorder.m

% preEmphasis.m

% normalizer.m

% hamWindow.m

% usefullSig.m

% specCreate.m

% capAnal.m

%

% Function: libCreate.m will be used to input audio signals into a

% reference sound library for use with the speech% recognition block of the project.

% Assumes fs=8192 Hz.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function libCreate()

fs=8192;

77



audioIn = recorder(); %get audio signal

%!!!!!!

%Need to look into effectiveness of preEmphasis network.

% Seems pointless for descrete SR.

%audioIn = preEmphasis(audioIn); %pass through pre-emphasis network



%sound(audioIn,fs);


%This may need more work.


%Testing showed it was better to save these as .wav's and convert them to

% spectrograms when needed.

%audioIn=specCreate(audioIn);

%dlmwrite(['Library/test' libNum '.txt'], audioIn, 'delimiter',

% ...'\t','precision', 4);

libNum=input('Please input number to be associated with file (ie 10-999).','s');

%String.

%wavwrite(audioIn,fs,['Library/test' libNum '.wav']);

wavwrite(audioIn,fs,['Library/lib' libNum '.wav']);

%wavwrite(audioIn,fs,['Library/setUp' libNum '.wav'])

end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the




%



%

% File: depSetUp.m


% Required Files:

%

% Function: matchMat.m creates the "local match matrix".

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

78



function out = matchMat(A,B)

A=abs(A); B=abs(B); %Need absolute values

%Calculates the cos of the angle between two vectors of each point in the

% matrix

%Find the average (RMS) value of the matrix, so that later when the A and

% B matrixes are multiplied, we can somewhat normalise them back to% reasonable levels.

sA= sqrt(sum(A.^2));

sB = sqrt(sum(B.^2));

Mat = (A'*B)./(sA'*sB);

out = Mat;

end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the




%

% To be used in conjunction with projects of% Chris Agam & Jon Hernandez

%

% File: normalizer.m



%

% Function: normalizer.m will normalize the data to 0.5, then pass it

% back.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = normalizer(x)

x = 0.5*x/max(abs(x));

out = x;

return

79



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the




%



%

% File: recorder.m



%

%% Function: recorder.m will return a 3 second audio signal.

%

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function [out fs]= recorder()

fs = 8192; %in Hz, default sampling frequency for sound(), etc.

t=3; %in s, number of seconds to record for.

ai_length = t*fs;

% Set up MatLab Oscilloscope / Winsound Analoginput

ai = analoginput('winsound');

addchannel(ai, 1);

set(ai, 'SampleRate', fs);

set(ai, 'TriggerType', 'manual');

set(ai, 'TriggerRepeat', 0);

set(ai, 'SamplesPerTrigger', ai_length);

%Look into changing this from a manual trigger to a rising edge:

%set(ai, 'TriggerType', 'software');

%set(ai, 'TriggerCondition', 'Rising');

%set(ai, 'TriggerConditionValue', 0.01);

%set(ai, 'TriggerChannel', ai.Channel(1));

%set(ai, 'TriggerDelay', -0.1);

%set(ai, 'TriggerDelayUnits', 'seconds'); %set(ai, 'TimeOut', 10);

% Get data from the microphone

beep on;

beep;

start(ai);

trigger(ai);

data = getdata(ai);

80



beep;

delete(ai);

out = data; %return the audio input.

return

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% To be used in the




%



%

% File: specCreate.m



%

% Function: specCreate.m will create a spectogram out of the input

% audio signal x.% Assumes x is a (length,1) vector and fs=8192 Hz.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = specCreate(x,fs)

%var=min(256,length(x));

%S = specgram(a,nfft,fs,window,numoverlap)

%x is the signal;

%window is the window WIDTH. ->use 512.

%noverlap = length of the window/2

%nfft=min(256,length(a)) is the default, seems good.%fs is assumed to be 8192 Hz.

%X = specgram(x,var,fs,var,var/2);

%Simpler form has max of 8 time windowing periods, not very accurate for

%DTW (?):

% X=spectrogram(x);

81



X = specgram(x,512,fs,512,384);

out = X;

return

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the% Electrical and Computer Engineering Project



%



%

% File: speechRec.m


% Required Files: library

% recorder.m

% preEmphasis.m

% normalizer.m% hamWindow.m

% usefullSig.m

% specCreate.m

% capAnal.m

% matchMat.m

% DTW.m

%

% Function: speechRec.m will be called by Jon Hernandez's c# program,

% which will pass in a character to be checked.

% speechRec.m will signal the user for audio input (ie.

% their answer/command), record this, process this, and test

% against a library using the method of Dynamic Time Warping.

% speechRec.m will then return information about the

% character being tested or if a command was entered.%

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = speechRec(in)

82




%!!!!!!






audioIn = cepAnal(audioIn);audioIn = hamWindow(audioIn);

%[audioIn,fs] = wavread(['Library/lib' int2str(in) int2str(1) '.wav']);


%Comparison loop.





c=cmin;


cp=0.7071*10000; %From experimental data, if the DTW block produces a value of






for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(x,fs);

M=matchMat(audioIn,Y);ctemp=abs(DTW(M)-cp);

if (ctemp<c)

c=ctemp;

r=m;

end

end

end

%returning block.

% 1-30 - incorrect character (1-30).

% 50 - no satisfactory match.

% 100 - correct character.

if r==0out=50;

elseif r==in;

out=100;

else

out=r;

end

return

83



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%

% To be used in the




%% To be used in conjunction with projects of


%

% File: usefullSig.m



%

% Function: usefullSig.m will clip out the parts of the signal which

% are considered useful (ie. it will remove the beginning

% and end, before and after user has spoken).

% Assumes x is a (length,1) vector.

% Sensitivity (z) should be - if soft spoken, + if loud.

%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = usefullSig(x)

%sensitivity.

z=-1;

if z<0

s=0.4;

elseif z>0

s=0.2;

else

s=0.3;

end

%Note: Changing Threshold will dramatically change matching ability

% Would like to have more adaptive thresh.

thresh=s*max(x); %s% of maximum.

l=length(x);

%f's for os of 10,20,50,100 respectively, with length of ~24k

%f=0.0004069;

%f=0.0008138;

%f=0.002;

f=0.0041;

os=floor((1+s)*f*l); %offset.

a=0; %The lower bound.

as=1;b=0;

bs=1;

for i=1:l

if (as && abs(x(i,1))>thresh)

a=i-os;

as=0;

end

84



if (bs && abs(x(l-i,1))>thresh)

b=l-i+os;

bs=0;

end

end

%Without these, there is the potential to go outside the vector bounds.

if (a-os)<1

a=1;end

if (b+os)>l

b=l;

end

% Trying to solve w problem:

% out(length(x),1)=0;

% out(a:b,1) = x(a:b,1);

out = x(a:b,1);

return

85



TEST FILE 1

clc;clear;close all;

fs=8192;

t=0:1/fs:3-1/fs;

audioIn=recorder();

figure,plot(t,audioIn);

sound(audioIn,fs)

%audioIn = preEmphasis(audioIn);

%figure,plot(audioIn);


figure,plot(t,audioIn);

sound(audioIn,fs)


figure,plot(audioIn);

sound(audioIn,fs)

audioIn=cepAnal(audioIn);

figure,plot(audioIn)

sound(audioIn,fs);


figure,plot(audioIn);

sound(audioIn,fs)

pause(1);

var=min(256,length(audioIn));

figure, specgram(audioIn,512,fs,512,384);

%figure, specgram(B,var,fs,var,var/2);

%figure, specgram(C,512,fs,512,384);

pause(1);close all;

TEST FILE 2

% This code is used to test the time it takes to open and covert a .wav file

% vs stroing the data as spectrograms in a .txt file and reading them directly.


86



t=1:4;

t1=zeros(1,4);

t2=zeros(1,4);

T1=zeros(1,4);

T2=zeros(1,4);

stxt=[51.3 111 89.8 16.3];

swav=[6.15 13.4 10.2 2.39];

Stxt=zeros(1,4);Swav=zeros(1,4);

for i=1:4

tic;

C=dlmread(['Library/test00' int2str(i) '.txt']);

t2(1,i)=toc;

end

for i=1:4

tic;

[audioIn fs]=wavread(['Library/test00' int2str(i) '.wav']);

C=specCreate(audioIn,fs);

t1(1,i)=toc;

end

T1(1,:)=sum(t1)/length(t1) %#ok<NOPTS>

T2(1,:)=sum(t2)/length(t2) %#ok<NOPTS>

Swav(1,:)=sum(swav)/length(swav) %#ok<NOPTS>

Stxt(1,:)=sum(stxt)/length(stxt) %#ok<NOPTS>

figure(1), subplot(2,1,1),plot(t,t1,'r',t,t2,'b',t,T1,'--r',t,T2,'--b'),ylabel('t

(in s)'),xlabel('Trial number'),title('.wav in red, .txt in blue');

subplot(2,1,2),plot(t,stxt,'b',t,swav,'r',t,Stxt,'--b',t,Swav,'--r'),ylabel('size

(in kB)'),xlabel('Trial number');

TEST FILE 3


fs=8192;

A=wavread('Library/lib271.wav');

var=min(256,length(A));

A=specgram(A,var,fs,var,var/2);

B=wavread('Library/lib272.wav');

var=min(256,length(B));

B=specgram(B,var,fs,var,var/2);

87



M=matchMat(A,B);

min=DTW(M);

TEST FILE 4


a=1;tic;

out=speechRec(a);

toc;

TEST FILE 5

% The purpose of this test is to see if the algorithm can recognise files

% already in the system.

%

%If it can't, then the algorithm is fundamentally broken.

%

%The results will be stored in an array.

%

%Hopefully, it is all 100s.

%NOTE: Update, successful test.




out=0;in=0; %predefining.

88



outArray=zeros(1,numLibEnt*(1+numLibSam));

outArrayCount=1;

cArray=zeros(1,numLibEnt*(1+numLibSam));

cArrayCount=1;

for a=1:numLibEnt

for b=0:numLibSam

%Read in each file.[x fs]=wavread(['Library/lib' int2str(a) int2str(b) '.wav']);

X=specCreate(x,fs);

%comparison loops.


c=cmin;








for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(x,fs);

M=matchMat(X,Y);


if (ctemp<c)

c=ctemp;

r=m;

end

end

end

cArray(1,cArrayCount)=c;cArrayCount=cArrayCount+1;

%returning block.




in=a;%want to see if the found r is a.

if r==0

out=50;

elseif r==in;

out=100;

else

out=r; end

outArray(1,outArrayCount)=out;

outArrayCount=outArrayCount+1;

end

end

89



figure(1),

subplot(1,2,1),plot(outArray),title('Results (100=match)');

subplot(1,2,2),plot(cArray),title('Value of C variable for match');

TEST FILE 6

% This file is created to run through all the files in the library and listen

% to them, for personal understanding.

numLibEnt=30; %number of library entries, 1-30 %1-26 being alphabet, 27-30 being enter, yes, no, back.


out=0;in=0; %predefining.

outArray=zeros(1,numLibEnt*(1+numLibSam));

outArrayCount=1;

for a=1:numLibEnt

for b=0:numLibSam

%Read in each file.

[x fs]=wavread(['Library/lib' int2str(a) int2str(b) '.wav']);

sound(x,fs);

end

end

TEST FILE 7

% This test is to take a look at what a c values a letter will generate over

% the whole testing algorithm.


90






cArrayCount=1;

%File to test (w=23)

in=1;

[x fs]=wavread(['Library/lib' int2str(in) '2.wav']);

X=specCreate(x,fs);

%comparison loops.


c=cmin;





for m=1:numLibEnt

for n=0:numLibSam

[y fs]=wavread(['Library/lib' int2str(m) int2str(n) '.wav']);

Y=specCreate(y,fs);

M=matchMat(X,Y);

ctemp=DTW(M);

%Added code here:

cArray(1,cArrayCount)=ctemp;

cArrayCount=cArrayCount+1;

if ctemp<c

c=ctemp;

r=m;

end

end

end

%returning block.




if r==0

out=50;

elseif r==in;

out=100;

else

out=r;

end

figure(1),plot(cArray),title(['Value of C variable for ' int2str(in)]);

91



TEST FILE 8

% Here I want to see the correspondance between the size of the audio file

% and the c values returned.





cArrayCount=1;

sArray=zeros(1,numLibEnt*(1+numLibSam));

sArrayCount=1;

%File to test

in=1;


X=specCreate(x,fs);

%comparison loops.


c=cmin;


r=0; %r is the variable for which is the current lowest match c. %if r stays as 0, we therefore never achieved a c lower than


for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(y,fs);

M=matchMat(X,Y);

ctemp=DTW(M);

%Added code here:

cArray(1,cArrayCount)=ctemp;

cArrayCount=cArrayCount+1;

sArray(1,sArrayCount)=size(y,1);

sArrayCount=sArrayCount+1;

if ctemp<c

c=ctemp;

r=m;

end

end

end

92



%returning block.




if r==0

out=50;

elseif r==in;

out=100;

elseout=r;

end

% figure(1),subplot(1,2,1),plot(cArray,'r'),title(['Value of C (red) v Size(b) for

' int2str(in)]);

% subplot(1,2,2),plot(sArray,'b');

figure(1),plot(cArray,'r'),title(['Value of C (red) v Size(b) for ' int2str(in)]);

hold on;plot(sArray,'b');hold off;

TEST FILE 9

% Here, this is going to be a test for how much time a certain number of

% iterations of the library will take (ie if the library has x samples)

% in regards to the comparison block.




mod=20; %number of times to modify the run through.

timeArray=zeros(1,mod);

%File to test

in=1;


X=specCreate(x,fs);


c=cmin;




93




for m=1:numLibEnt

for l=1:mod

tic;

for n=0:numLibSam


Y=specCreate(y,fs);

M=matchMat(X,Y);ctemp=DTW(M);

if ctemp<c

c=ctemp;

r=m;

end

end

t=toc;

t=t/3; %going to take an average for better results.

if l==1

timeArray(1,l)=t;

else

timeArray(1,l)=t+timeArray(1,l-1);

end

end

end

figure(1),plot(timeArray),title('Time to run through comparison vs number of

samples in library'),xlabel('Number of samples'),ylabel('t (in s)');

TEST FILE 10

% Doing timings for varying sizes of wav files.


swav=[6.15 13.4 10.2 2.39]; %sizes of test wav files.

numSam=4;

numRun=20;

tNorm(1,numSam)=0;

tUsef(1,numSam)=0;

tCepA(1,numSam)=0;

tHamW(1,numSam)=0;

tSpec(1,numSam)=0;

tReco(1,numSam)=0;

94



for i=1:numSam

[x fs]=wavread(['Library/test00' int2str(i) '.wav']);

t=0;

for m=1:numRun

tic;

xp=normalizer(x);

t=t+toc; end

t=t/numRun;

tNorm(1,i)=t;

figure(1),subplot(3,2,2),stem(swav,tNorm),title('Times of nomalizer.m vs

Size'),xlabel('Size in kB'),ylabel('t (in s)');

t=0;

for m=1:numRun

tic;

xp=usefullSig(x);

t=t+toc;

end

t=t/numRun;

tUsef(1,i)=t;

figure(1),subplot(3,2,3),stem(swav,tUsef),title('Times of usefullSig.m vs


t=0;

for m=1:numRun

tic;

xp=cepAnal(x);

t=t+toc;

end

t=t/numRun;

tCepA(1,i)=t;

figure(1),subplot(3,2,4),stem(swav,tCepA),title('Times of capAnal.m vs


t=0;

for m=1:numRun

tic;

xp=hamWindow(x);

t=t+toc;

end

t=t/numRun;

tHamW(1,i)=t;

figure(1),subplot(3,2,5),stem(swav,tHamW),title('Times of hamWindow.m vs


t=0; for m=1:numRun

tic;

xp=specCreate(x,fs);

t=t+toc;

end

t=t/numRun;

tSpec(1,i)=t;

figure(1),subplot(3,2,6),stem(swav,tSpec),title('Times of specCreat.m vs

95




t=0;

for m=1:numRun

tic;

xp=recorder();

t=t+toc;

end

t=t/numRun;tReco(1,i)=t;

figure(1),subplot(3,2,1),stem([1 2 3 4],tReco),title('Times of recorder.m for

Four trials'),xlabel('Trial'),ylabel('t (in s)');

end

TEST FILE 11

% Doing timing for matchMat and DTW for vaious sizes of wavs.


swav=[6.15 13.4 10.2 2.39]; %sizes of test wav files.

numSam=4;

numRun=20;

tMatc(1,numSam)=0;tDTW(1,numSam)=0;

for i=1:numSam

[x fs]=wavread(['Library/test00' int2str(i) '.wav']);

X=specCreate(x,fs);

t=0;

for m=1:numRun

tic;

XP=matchMat(X,X);

t=t+toc;

end

t=t/numRun;tMatc(1,i)=t;

figure(1),subplot(2,1,1),stem(swav,tMatc),title('Times of matchMat.m vs


t=0;

for m=1:numRun

tic;

a=DTW(XP);

96



t=t+toc;

end

t=t/numRun;

tDTW(1,i)=t;

figure(1),subplot(2,1,2),stem(swav,tDTW),title('Times of DTW.m vs


end

TEST FILE 12

% Doing timing for speechRec.mfunction testFileTwelve()

numTri=20; %number of trials.

t=0;

tSpec(1,numTri)=0;

sSpec(1,numTri)=0;

for m=1:numTri

t=0;

tic;

[out fileSize]=speechRecTest(5);

t=toc;

tSpec(1,m)=t;

sSpec(1,m)=fileSize;end

rSpec=tSpec./sSpec;

figure(1);

subplot(3,1,1),stem(tSpec,'r'),title('Time of speechRec.m in red for trials (in

s)');

subplot(3,1,2),stem(sSpec,'b'),title('Size of audio in files in blue (in Array

Size)');

subplot(3,1,3),stem(rSpec,'g'),title('Ratio between time and size (in

s/ArraySize)');

t=0;%So I can stop debugger.

end

function [out fileSize]=speechRecTest(in)


97



%!!!!!!






%ADDED CODE HERE TO GET SIZE

fileSize=size(audioIn,1);




%Comparison loop.





c=cmin;





for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(x,fs);


ctemp=DTW(M);

if ctemp<c

c=ctemp;

r=m; end

end

end

%returning block.




if r==0

out=50;

elseif r==in;

out=100;

elseout=r;

end

end

98



TEST FILE 13

%This is comparing the c values of the three a's in the library currently.

letter=1; %which letter to compare.


s(1,(1+numLibSam))=0;

c(1,(1+numLibSam))=0;

for m=0:numLibSam

[x fs]=wavread(['Library/lib' int2str(letter) int2str(m) '.wav']);

X=specCreate(x,fs);

s(1,m+1)=size(x,1);

M=matchMat(X,X);c(1,m+1)=DTW(M);

end

figure(1), stem(s,c),ylabel('c value returned for perfect match'),xlabel('size of

array'),

title(['c value returned for letter "' int2str(letter) '"']);

TEST FILE 14

% Testing values of c's across libSamples.




s(1,(numLibEnt))=0;

c(1,(numLibEnt))=0;

for m=1:numLibEnt

[x fs]=wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);

X=specCreate(x,fs);

s(1,m)=size(x,1);

M=matchMat(X,X);

99



c(1,m)=DTW(M);

end

figure(1), stem(s,c),ylabel('c value returned for perfect match'),xlabel('size of

array'),

title(['c value returned for sample "' int2str(numLibSam) '"']);

TEST FILE 15

% Testing values of c's across libSamples, now comparing with size of

% spectrograms. C's will be for perfect match.




aX(1,(numLibEnt))=0;

aM(1,(numLibEnt))=0;

c(1,(numLibEnt))=0;

for m=1:numLibEnt

[x fs]=wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);

X=specCreate(x,fs);

[mX,nX]=size(X);

aX(1,m)=mX*nX;

M=matchMat(X,X);

[mM,nM]=size(M);

aM(1,m)=mM*nM;

c(1,m)=DTW(M);

end

figure(1), subplot(2,1,1),stem(aX,c),ylabel('c value returned for perfect

match'),xlabel('area of spec'),

title(['c value returned for sample "' int2str(numLibSam) '"']);

subplot(2,1,2),stem(aM,c),ylabel('c value returned for perfect match'),xlabel('area

of match matrix'),

100



TEST FILE 16

%Testing bits of Dr. Ellis' code.





s(1,(numLibEnt))=0;

c(1,(numLibEnt))=0;

for m=1:numLibEnt

[d1,sr] = wavread(['Library/lib' int2str(1) int2str(numLibSam) '.wav']);

[d2,sr] = wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);

% Listen to them together:

ml = min(length(d1),length(d2));soundsc(d1(1:ml)+d2(1:ml),sr)

% or, in stereo

soundsc([d1(1:ml),d2(1:ml)],sr);

D1 = specgram(d1,512,sr,512,384);

D2 = specgram(d2,512,sr,512,384);

SM = matchMat(D1,D2);

figure(1)

subplot(121)

imagesc(SM)

colormap(1-gray)

[p q C cp]=DTWTWO(1-SM);

hold on; plot(q,p,'r'); hold off

subplot(122)

imagesc(C)

hold on; plot(q,p,'r'); hold off

% c(1,m)=C(size(C,1),size(C,2))/(size(C,1)*size(C,2));

c(1,m)=cp;

s(1,m)=size(C,1);

end

figure(2),stem(s,c),xlabel('library entry'),ylabel('c value'),title('c values for

library entries');

101



TEST FILE 17

%Testing my code in style of testFileSixteen





s(1,(numLibEnt))=0;

c(1,(numLibEnt))=0;

for m=1:numLibEnt

[d1,sr] = wavread(['Library/lib' int2str(2) int2str(numLibSam) '.wav']);

[d2,sr] = wavread(['Library/lib' int2str(m) int2str(numLibSam) '.wav']);

% % Listen to them together:

% ml = min(length(d1),length(d2));

% soundsc(d1,sr)

% % or, in stereo

% soundsc(d2,sr);

D1 = specgram(d1,512,sr,512,384);

D2 = specgram(d2,512,sr,512,384);

M = matchMat(D1,D2);

cp=DTW(M);

c(1,m)=cp;

s(1,m)=size(M,2);

end

e(1,30)=0for m=1:30

e(1,m)=7071;

end

figure(1),subplot(3,1,1),stem(e,'r'),

hold on, stem(c,'b'),xlabel('library entry'),ylabel('c value'),title('c values for

library entries');,hold off;

subplot(3,1,2),stem(s,'b'),xlabel('library entry'),ylabel('Size of Match

Matrix'),title('size of match matric for library entries');

subplot(3,1,3), stem(e,'r'),

hold on,stem(s,c,'b'),xlabel('Size of Match Matrix'),ylabel('c value'),title('c

values for Size of Match Matrix');,hold off;

102



TEST FILE 18

%For getting pictures.


[audioIn fs]= recorder();% sound(audioIn,fs)

figure(1),plot(audioIn),xlabel('Time Axis'),ylabel('Signal Magnitude'),title('Input

sound');


figure(2),subplot(2,2,1),plot(audioIn),xlabel('Time Axis'),ylabel('Signal

Magnitude'),title('Normalized sound');



Magnitude'),title('Useful sound');



Magnitude'),title('Post Cepstral Filtering');

audioIn = hamWindow(audioIn);figure(2),subplot(2,2,4),plot(audioIn),xlabel('Time Axis'),ylabel('Signal

Magnitude'),title('Window`d (hamming) sound');

sound(audioIn,fs)

close all;

[y fs]=wavread(['Library/lib31.wav']);

figure(3),subplot(3,1,1),specgram(y,512,fs,512,384),title('Spectrogram of Input

Sound `c`');

Y=specCreate(y,fs);

[x fs]=wavread(['Library/lib21.wav']);

subplot(3,1,2),specgram(x,512,fs,512,384),title('Spectrogram of Close Library Sound`b`');

X=specCreate(x,fs);

[z fs]=wavread(['Library/lib231.wav']);

subplot(3,1,3),specgram(z,512,fs,512,384),title('Spectrogram of Far Library Sound

`w`');

Z=specCreate(z,fs);

MP=matchMat(X,X);

figure(4),subplot(3,2,1),imagesc(MP),colormap(1-gray),title('Perfectly Matching

Input Specs');

[p q CP cp]=DTWTWO(1-MP);

subplot(3,2,2),imagesc(CP);hold on; plot(q,p,'r'); hold off;title('Match Matrix(left), Quickest Path (right)');

M=matchMat(X,Y);

subplot(3,2,3),imagesc(M),colormap(1-gray),title('Somewhat Matching Input Specs');

[p q C c]=DTWTWO(1-M);

subplot(3,2,4),imagesc(C);hold on; plot(q,p,'r'); hold off

103



MO=matchMat(X,Z);

subplot(3,2,5),imagesc(MO),colormap(1-gray),title('Poorly Matching Input Specs');

[p q CO co]=DTWTWO(1-MO);

subplot(3,2,6),imagesc(CO);hold on; plot(q,p,'r'); hold off

TEST FILE 19

%To test the newer versions of DTW.clc;clear;

[y fs]=wavread(['Library/lib31.wav']);

Y=specCreate(y,fs);

[x fs]=wavread(['Library/lib21.wav']);

X=specCreate(x,fs);


Z=specCreate(z,fs);

M=matchMat(X,Z);

imagesc(M),colormap(1-gray),title('Poorly Matching Input Specs');

c=DTWTHREE(M);

TEST FILE 20

%Will test if there's a time difference beween DTW.m and DTWTHREE.m

%DTWTHREE has the breaking code.

clc;clear;


104






c=cmin;


cp=0.7071*10000; %From experimental data, if the DTW block produces a value of



r=0;

tic

for a=1:numLibEnt

for b=1:numLibSam

[audioIn,fs] = wavread(['Library/lib' int2str(a) int2str(b) '.wav']);


for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(x,fs);



if (ctemp<c)

c=ctemp;

r=m;

end

end

end

end

end

t0=toc %#ok<NOPTS>

tic

for a=1:numLibEnt

for b=1:numLibSam[audioIn,fs] = wavread(['Library/lib' int2str(a) int2str(b) '.wav']);


for m=1:numLibEnt

for n=0:numLibSam


Y=specCreate(x,fs);


ctemp=abs(DTWTHREE(M)-cp);

if (ctemp<c)

c=ctemp;

r=m;

end end

end

end

end

t1=toc %#ok<NOPTS>

105



TEST FILE 21

%This tests more specific differences beween DTW.m and DTWTHREE.m

clc;clear;

a=rand %#ok<NOPTS>

[x fs]=wavread(['Library/lib21.wav']);X=specCreate(x,fs);


Z=specCreate(z,fs);

L=30000000;

t0(1,2)=0;

t1(1,2)=0;

for m=L;

M=matchMat(X,X);tic

c=DTW(M);

t0(1,1)=toc+t0(1,1);

end

for m=L;

M=matchMat(X,X);

tic

c=DTWTHREE(M);

t1(1,1)=toc+t1(1,1);

end

for m=L;

M=matchMat(X,Z);

tic

c=DTW(M);

t0(1,2)=toc+t0(1,2);

end

for m=L;

M=matchMat(X,Z);

tic

c=DTWTHREE(M);

t1(1,2)=toc+t1(1,2);

end

t0%#ok<NOPTS>

t1%#ok<NOPTS>

stem([1,2],t0,'b'),hold on, stem([1,2],t1,'r'),title('DTW in blue, DTWTHREE in

red')

106



References[1] Chiba, S, and Sakoe, H., “Dynamic programming algorithm optimization for spoken wordrecognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 26 pp. 43- 49.

[2] Dumitru, C.O. and Gavat, I., “Vowel, digit and continuous speech recognition based on statistical,neural and hybrid modelling by using ASRS_RL,” EUROCON 2007 - The International Conference on

Computer as a Tool, pp. 856-863, September 2007.

[3] Ellis, Dan. "Dynamic Time Warp (DTW) in Matlab." Dan Ellis's Home Page (Columbia University

Electrical Engineering) . Web. http://www.ee.columbia.edu/~dpwe/resources/matlab/dtw/

[4] Flanagan, JL. Speech Analysis: Synthesis & Perception. New York: Academic In., 1965. Print.

[5] Fry, DB. The Physics of Speech. Cambridge: Cambridge UP, 1979. Print.

[6] Gold, B., and N. Morgan. Speech and Audio Signal Processing. John Wiley & Sons Inc., 2000.Print.

[7] Hart, P. "Voice recognition: what all the talk is about," Telecommunication,. vol 29. no 7. July 1995.

[8] Jawed, F., Muzaffar, F. et al. “DSP implementation of voice recognition using dynamic timewarping algorithm,” 2005 Student Conference on Engineering Sciences and Technology, SCONEST .Karachi, Pakistan, 2005.

[9] Kale, Kaustubh R. "Dynamic Time Warping." Computaional NeuroEngineering Lab at the

University of Florida. Web. http://www.cnel.ufl.edu/~kkale/dtw.html

[10] Mrvaljevic, N. and Ying, S. “Comparison between speaker dependent mode and speaker

independent mode for voice recognition,” Bioengineering, Proceedings of the Northeast Conference,Boston, United States of America, April. 2009.

[11] Nelson, B. and Runger, G., “Predicting processes when embedded events occur: Dynamic timewarping,” Journal of Quality Technology, vol 35, no 2, pp. 213-226, April 2003.

[12] National Federation of the Blind, “Braille readers are leaders,” [Online] 2009 Available:http://www.nfb.org/nfb/Braille_coin.asp [Accessed: Oct. 7 2009]

[13] The MathWorks Store, [Online] 2009Available: http://www.mathworks.com/store/ [Accessed: Oct. 4 2009]

[14] Lindsay, B. "4BI6 Group 13 Logbook," 2009-2010.

107



VITANAME: Brett LindsayPLACE OF BIRTH: Burlington Ontario, CanadaYEAR OF BIRTH: 1988SECONDARY EDUCATION: Lord Elgin High School (2002-2004)

Robert Bateman High School (2004-2006)

UNDERGRAD EDUCATION: McMaster University (2006-2010)HONOURS and AWARDS: Queen Elizabeth II Aiming for the Top Scholarship 2006

McMaster Entrance ScholarshipSmurfit-Stone Scholarship 2006, 2007, 2008, 2009Dean’s Honour List 2007, 2009

Date post:	07-Apr-2018
Category:	Documents
Upload:	sree-hari
View:	218 times
Download:	0 times

Design of a Limited Speech Recognition System for Use in a Braill

Documents