+ All Categories

Thesis

Date post: 02-Oct-2015
Category:
Upload: bhoomika-shetty-m
View: 5 times
Download: 0 times
Share this document with a friend
Description:
Speech Recognition
39
MASARYK UNIVERSITY FACULTY OF I NFORMATICS Octave System Sound Processing Library BACHELORS THESIS Lóránt Oroszlány Brno, spring 2012
Transcript
  • MASARYK UNIVERSITYFACULTY OF INFORMATICS

    }w !"#$%&'()+,-./012345

  • Declaration

    Hereby I declare, that this paper is my original authorial work, whichI have worked out by my own. All sources, references and literatureused or excerpted during elaboration of this work are properly citedand listed in complete reference to the due source.

    Lrnt Oroszlny

    Advisor: Mgr. Ludek Brtek, Ph.D.

    ii

  • Acknowledgement

    I would like to thank my supervisor Mgr. Ludek Brtek, Ph.D. for hissupport and guidance, which helped me write this thesis.

    iii

  • Abstract

    The aim of the bachelor work is to introduce the reader to the ba-sics of digital sound processing and describe a few selected methodsused in the field of speech processing. Furthermore it provides theevaluation of the usability of four popular mathematical software inthe mentioned area and their comparison. As the elaborated speechprocessing methods are not directly implemented in these programs,the practical part of this thesis consists of their implementation as alibrary of functions for the open-source Matlab-alternative, GNU Oc-tave.

    iv

  • Keywords

    DSP, digital signal processing, audio processing, speech processing,numerical computation, computer algebra system, CAS, octave, mat-lab, maple.

    v

  • Contents

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Organization of this thesis . . . . . . . . . . . . . . . . . 31.2 Digital audio processing . . . . . . . . . . . . . . . . . . 3

    1.2.1 Digital representation of sound . . . . . . . . . . 41.2.2 Working with audio signals . . . . . . . . . . . . 51.2.3 Practical application . . . . . . . . . . . . . . . . 5

    2 Analysis of speech . . . . . . . . . . . . . . . . . . . . . . . . 72.1 The source/system speech production model . . . . . . 72.2 Perception of sound . . . . . . . . . . . . . . . . . . . . 8

    2.2.1 Pitch perception . . . . . . . . . . . . . . . . . . 82.2.2 Loudness . . . . . . . . . . . . . . . . . . . . . . 9

    2.3 Short-time analysis . . . . . . . . . . . . . . . . . . . . . 102.3.1 Window functions . . . . . . . . . . . . . . . . . 10

    2.4 Time-domain analysis . . . . . . . . . . . . . . . . . . . 112.5 Frequency-domain analysis . . . . . . . . . . . . . . . . 13

    2.5.1 Fourier transform . . . . . . . . . . . . . . . . . . 132.5.2 Discrete Fourier transform . . . . . . . . . . . . . 142.5.3 Short-time Fourier transform . . . . . . . . . . . 14

    2.6 Linear predictive analysis . . . . . . . . . . . . . . . . . 152.6.1 Perceptual linear prediction . . . . . . . . . . . 16

    2.7 Cepstral analysis . . . . . . . . . . . . . . . . . . . . . . 162.7.1 Mel-frequency cepstral coefficients . . . . . . . . 17

    3 Sound analysis with mathematical software . . . . . . . . . 183.1 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 GNU Octave . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Maple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Wolfram Mathematica . . . . . . . . . . . . . . . . . . . 203.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.5.1 Plotting . . . . . . . . . . . . . . . . . . . . . . . 224 The speech package for GNU Octave . . . . . . . . . . . . . 24

    4.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Input handling . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Function reference . . . . . . . . . . . . . . . . . . . . . 25

    4.3.1 The sti(), ste() and zcr() functions . . . . . . . . . 254.3.2 The stf() function . . . . . . . . . . . . . . . . . . 27

    1

  • 4.3.3 The stacf() function . . . . . . . . . . . . . . . . . 274.3.4 The lpc_cov() function . . . . . . . . . . . . . . . 284.3.5 The strceps() and stcceps() functions . . . . . . . 294.3.6 The plp() and mfcc() functions . . . . . . . . . . 294.3.7 Complementary functions . . . . . . . . . . . . . 30

    4.4 Known issues . . . . . . . . . . . . . . . . . . . . . . . . 305 Final thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2

  • Chapter 1

    Introduction

    1.1 Organization of this thesis

    Chapter 1 introduces the concepts and notions of the subject matterof this thesis

    Chapter 2 covers the theory of the methods utilized in the later partsof the thesis

    Chapter 3 explores the existing possibilities of analyzing audio sig-nals using mathematical software

    Chapter 4 introduces the speech package for GNU Octave createdas the practical part of the thesis and provides a discussion onissues experienced during its development

    Chapter 5 presents the conclusions reached and final thoughts onthe topic

    1.2 Digital audio processing

    Audio is a Latin word for the phrase to hear, meaning the perceptionof sound. Sound is an acoustic wave (oscillation of pressure propagat-ing trough a medium) which is composed of frequencies within thehearing range (about 20 Hz 20 KHz for an average human)[1]. It canbe represented in the form of signals (signal = a function of indepen-dent variables that carry some information[2]). Signal processingis a very wide field of study devoted to analysis of and operationson signals be it continuous, discrete, periodic, aperiodic, acoustic or

    3

  • 1. INTRODUCTION

    Figure 1.1: Waveform of spoken sequence I am prepared, encoded withLPCM-coding, visualized using Octave

    electromagnetic. This thesis focuses on the analysis of discrete-timeaudio signals.

    1.2.1 Digital representation of sound

    Natural sound perceived by human ears is analog - continuous intime, defined with infinite precision[3]. Signals with these proper-ties however cannot be represented equivalently in a digital im-plicitly discrete environment. To be able to work with them in suchenvironment, they need to be converted to a suitable format. Ana-log to digital (A/D) conversion is done using methods of samplingand quantization. There are different ways of storing the resultingdata, the most fitting for our purpose is linear pulse-code modula-tion (LPCM). A digital audio signal encoded with LPCM is an ar-ray of values which correspond to the values of the magnitude ofthe original signal sampled at equal intervals of time and quantizedto the nearest value within a digital step. The two main character-istics of a sound wave encoded in LPCM is its sample rate and itsbit per sample ratio (not to be confused with bit rate, which means

    4

  • 1. INTRODUCTION

    number of bits per second), which determine the quality of the repre-sentation. According to the Nyquist-Shannon sampling theorem, thesampling rate (number of samples per second) must be two timesthe frequency of the highest frequency component in the sampledsignal, otherwise it comes to aliasing and the recreation of the sig-nal is no longer possible[4]. Higher bit rate allows more precision forstoring the value of the magnitude of the signal sampled at a spe-cific time, causing the recreated signal to be less distorted and moresimilar to the sampled signal. Exact recreation of the sampled ana-log signal from discrete samples is not possible. Figure 1.1 shows thewaveform for a sound sequence encoded with LPCM with samplerate of 44100 and 16-bit sample size. LPCM is used by most of thepopular non-compressed audio formats (CD audio, WAV, AIFF), andis used exclusively throughout this thesis.

    1.2.2 Working with audio signals

    Due to the properties of LPCM encoding, discrete-time signal pro-cessing methods are easily applicable on data encoded in this man-ner. Most of these methods consist of the application of various math-ematical formulas over the array of values represented by LPCM-coded data. While the majority of programming languages is fullycapable to realize these types of calculations, there are mathemati-cal software developed specifically for the purpose of solving tech-nical computational problems. Two basic types of such software dif-fer in their approach to solve these problems: computer algebra sys-tems manipulate mathematical expressions in symbolic form, whilenumerical computing systems focus on performing numerical algo-rithms. Chapter 3 is devoted to elaborating the usefulness and com-paring the power of these products in terms of signal processing,more specifically sound analysis.

    1.2.3 Practical application

    Practical applications of sound processing vary on a wide scale in-cluding storage, level compression, data compression, transmission,enhancement and speech processing[5]. This thesis focuses on themethods used for speech processing, specifically the ones for short-

    5

  • 1. INTRODUCTION

    time analysis of speech signal in time and frequency domain. Theextraction of information from the results of these analyses is thesubject to speech recognition, speaker recognition and voice analy-sis and are not included in the scope of this thesis. Speech processingitself is a very wide topic, comprising among others speech coding,synthesis and enhancement, and is covered in-depth in [6], [7].

    6

  • Chapter 2

    Analysis of speech

    This chapter is a short overview of the theoretical background ofthe sound analysis methods utilized or referenced to in later parts ofthis thesis. All the information in this chapter is processed from [4],[6] and [7], except where noted otherwise.

    2.1 The source/system speech production model

    Many of the algorithms elaborated later in this chapter assumesthe source/system speech production model, which consists of an excita-tion generator and a time-varying linear system (see Figure 2.1). Theexcitation generator simulates the modes of sound generation in thevocal tract. Voiced sounds are excited by periodic pulses of air pres-sure, the frequency of which determines the perceived pitch of thesound. Unvoiced sounds are excited by random white noise. The lin-ear system simulates the frequency shaping of the vocal tract tube ata specific time. The parameters of the linear system change in time at

    Figure 2.1: The source/system model for a speech signal [7]

    7

  • 2. ANALYSIS OF SPEECH

    a much slower rate than the time variations of the speech waveform,allowing us to find them by analyzing short segments of the signal.Since the vocal tract changes shape slowly, it can be viewed as time-invariant over intervals varying on the order of 10 ms, depending onthe speaker. Although there are more sophisticated models of speechproduction, this model is sufficient for most applications in speechprocessing[7].

    2.2 Perception of sound

    This subsection of this chapter briefly introduces a few psychoa-coustic phenomena utilized in later parts of this thesis without goinginto physiologic details of the human auditory system. Detailed de-scription of the matter can be found in [6], [7].

    2.2.1 Pitch perception

    Sounds that have a periodic structure on short time intervals areperceived as having a subjective quality called pitch. The relationof pitch to the fundamental frequency of the sound waveform wasempirically determined and can be approximated by the followingequation:

    m = 2595 log10 10(1 + f/700) , (2.1)

    where m is perceived pitch in [mel] and f is frequency in [Hz].Mel is the unit of subjective pitch. By definition, a signal with fre-quency 1000 Hz and loudness level 40 phon has pitch of 1000 mel.A sound having two times as high subjective pitch than a referentialsound has also two times the value of pitch in mels. Figure 2.2 showsthe non-linear relation of the pitch in [mel] to the logarithm of thefundamental frequency of the sound in [Hz].

    Another psychoacoustic phenomenon utilized in speech analysisis the concept of critical bands. Due to the structure of the basilarmembrane in the inner ear, when listening to a pure tone with acertain frequency f , noise outside of a frequency band around thecentral frequency f does not have an impact on the sensation of the

    8

  • 2. ANALYSIS OF SPEECH

    Figure 2.2: Relation of subjective pitch of sound to its frequency

    tone. The width of the critical band depends on the central frequencyf . This effect can be represented as the application of a set of band-pass filters to the sound signal. Frequency unit bark was introducedto capture the idea of this phenomenon; the width of a critical band isroughly 1 bark at any frequency. The relation of the frequency in [Hz]to the frequency in [bark] is described by the following equation:

    = 6ln

    f600

    +

    (f

    600

    )2+ 1

    (2.2)2.2.2 Loudness

    Loudness is a subjective measure, not to be confused with objec-tive measures like sound pressure or sound intensity. The perceivedloudness is related to sound pressure, duration and frequency. Equalloudness curves evaluate the difference between perceived loudnessand the sound pressure of a pure tone over the audible frequencyspectrum. These curves are defined by international standard ISO226:2003 based on several modern experimental determinations[8].The unit of loudness level for pure tones is phon, 1 phon is definedas the loudness of a 1 dB sound pressure level pure tone at 1 kHzfrequency. A different unit is also used for capturing the loudnessphenomenon, sone, but it is not used in this thesis.

    9

  • 2. ANALYSIS OF SPEECH

    Figure 2.3: Equal loudness curves defined by revised standard ISO226:2003 (Blue line shows the original ISO standard for 40 Phons)[9]

    2.3 Short-time analysis

    2.3.1 Window functions

    For reasons discussed in section 2.1, the analysis of speech mostlyrelies on analysis of short segments of the speech waveform, calledmicrosegments. Most of the short-time analysis methods can be de-scribed by the relation

    Qn =

    k=(s(k))w(n k) , (2.3)

    where s(k) is the value of the analyzed LPCM-coded signal in timek, (.) is the transformation used and w(n) is a weighing function orwindow function. A window function is a mathematical function thatis zero-valued outside of some chosen interval. Its purpose is to selecta segment in the neighborhood of a central sample and optionally to

    10

  • 2. ANALYSIS OF SPEECH

    suppress the effect of the marginal samples of the analyzed segment.Many types of differently shaped windows exist. In speech process-ing the two most commonly used are the rectangular and the Ham-ming window. The rectangular window does not do any weighing,only selects the samples included in the microsegment. Its value is 1inside the chosen interval and 0 outside of it. The Hamming windowis defined by the following relation:

    w(n) =

    {0.54 0.46cos(2pin/(L 1)) for 0 n L 10 otherwise,

    (2.4)

    where L is the length of the window in samples. The windowlength in seconds is L divided by the sampling frequency of the an-alyzed signal. Figure 2.4 illustrates the selection of microsegmentsusing the Hamming window.

    Figure 2.4: Example of the application of the Hamming window to aspeech waveform

    2.4 Time-domain analysis

    Short-time energy (STE), short-time intensity (STI) and short-timezero-crossing rate (STZCR) are three basic functions used in speechprocessing to estimate the parameters of the excitation signal in the

    11

  • 2. ANALYSIS OF SPEECH

    source/system production model. The function of short-time energyis defined as

    En =

    k=[s(k)w(n k)]2 , (2.5)

    and gives information about the average energy of the signal inthe microsegment. The short-time intensity defined as

    Mn =

    k=|s(k)|w(n k) , (2.6)

    represents the same quality of a signal, but this function is lesssensitive to abrupt changes in the amplitude of the analyzed signaldue to using absolute value instead of raising to second power. Theshort-time zero-crossing rate is a simple measure of how many timesdid the amplitude change its premonitory sign. It is defined as

    Mn =

    k=|sgn[s(k)] sgn[s(k 1)]|w(n k) , (2.7)

    where

    sgn[s(k)] =

    {1 for s(k) 00 for s(k) < 0

    (2.8)

    and w(n) is a rectangular window. The most common use of theabove mentioned functions is to detect the beginning and end ofspeech in a signal and to distinguish voiced and unvoiced sounds.The suggested window size is from the interval 10-25 ms.

    Another time-domain function worth mentioning is the short-timeautocorrelation function (STACF), defined as

    Rn(m) =

    n=s(k)w(n k)s(k +m)w(n k m) , (2.9)

    where w(n) is a rectangular or Hamming window. Notice thatRn(m) is a two dimensional function, where each value for time index

    12

  • 2. ANALYSIS OF SPEECH

    n is an array of m values. It is used to detect periodicity in signals andis the basis of many spectral analysis methods. To be able to detectthe pitch period, the windowed segment must contain at least twofundamental periods of the analyzed signal, so it is recommended touse windows with lengths 20-40 ms, depending on the speaker.

    2.5 Frequency-domain analysis

    2.5.1 Fourier transform

    Fourier transform is a mathematical operation, which expresses afunction of time as a function of frequency, called the frequency spec-trum. Under suitable conditions, the time domain signal can be re-constructed from the frequency domain representation (using the in-verse Fourier transform). There is a strong connection between thetwo representations (e.g. convolution of two signals in time domaincorresponds to their multiplication in the frequency domain), whichmakes the Fourier transform the most important method in signalprocessing and a basis for numerous other methods.

    For application on discrete signals, the discrete-time Fourier trans-form (DTFT) is defined as

    X() =

    k=xke

    ik , (2.10)

    where denotes the angular frequency and xn denotes the n-th sample of the signal. It can be seen from the definition, that thevalues of the DTFT can be complex even when working with realsignals. The DTFT of an aperiodic signal (speech segments are peri-odic only over short time intervals, so they are treated as aperiodic inthis formula) is a continuous periodic complex function with period2pi. The magnitude of the DTFT |X()| is an even function, while thephase ]X() is odd.

    13

  • 2. ANALYSIS OF SPEECH

    2.5.2 Discrete Fourier transform

    To be able process the spectra of discrete signals using comput-ers, we also need them to be discrete. To achieve this, the discreteFourier transform (DTF) method is used. The DTF of a discrete signalof length N (x0, ..., xn1) is also of length N, and can be computedusing the following formula:

    Xn =N1k=0

    xkei2pi n

    Nk . (2.11)

    The frequency represented by value Xn is nN .sf for values n dN/2e, where sf is the sampling frequency. Values where n > dN/2econtain redundant information about the signal.

    When computing the DFT of a signal using the definition, it takesO(N2) operations to get the results. There is a group of less time-consuming algorithms called fast Fourier transforms (FFT), which cancompute the DFT an its inverse in only O(NlogN) operations. Themost well-known of these is the Cooley-Tukey algorithm [4], whichcan compute a DFT of a signal whose length is a power of two inonly (N/2)log2N complex multiplies and Nlog2N complex additions[10]. The description of these algorithms is not in the scope of thisthesis.

    2.5.3 Short-time Fourier transform

    As with other short-time analysis methods, we can get the formulafor the short-time Fourier transform with combining (2.3) with (2.11):

    X(n, ) =

    k=s(k)w(n k)ei2pi nN k . (2.12)

    The short-time Fourier transform of a signal is often representedusing spectrograms, see figure 2.5.

    14

  • 2. ANALYSIS OF SPEECH

    Figure 2.5: Spectrogram of speech sequence shown in figure 1.1

    2.6 Linear predictive analysis

    Linear predictive analysis is one of the most powerful methods ofspeech analysis. It is a method used to estimate the parameters of thespeech production model described in chapter 2.1. The linear systemin this model can be described by an all-pole model system function:

    H(z) =S(z)

    E(z)=

    G

    1pi=1 aizi , (2.13)where p is the order of the model and G is the gain parameter.

    The goal of linear prediction is to estimate the linear prediction co-efficients (LPC) ai with minimized average squared prediction errorbased on the assumption that the k-th sample of signal s(k) can bedescribed as the linear combination ofQ previous samples of the sig-nal and the excitation u(k):

    s(k) =

    Qi=1

    ais(k i) +Gu(k) . (2.14)

    There are several methods to estimate these parameters, the oneimplemented in the practical part of the thesis is the covariance method.The method is elaborated in [7], for our purpose it is sufficient to

    15

  • 2. ANALYSIS OF SPEECH

    know how to reckon the coefficients. The following formula describesa symmetric positive-semi-definite matrix which represents a systemof linear equations:

    n[i, k] =

    M2km=M1k

    sn[m]sn[m+ k i] (2.15)

    The solution of this system are the linear prediction coefficientsestimated with the covariance method.

    2.6.1 Perceptual linear prediction

    Perceptual linear prediction (PLP) is a revision of the linear predic-tion method, taking into account the ideas of sound perception de-scribed in section 2.2. A few transformations are applied to the ana-lyzed signal before describing it by an all-pole model:

    Calculation of the power spectrum of the speech signal Non-linear transformation of the frequency spectrum to the

    bark-scale described in 2.2.1 and application of a set of band-pass filters representing critical bands of hearing

    Weighing the samples in respect to an equal loudness curve Application of the relation between the intensity of the sound

    and perceived loudness.

    This method was introduced by Hynek Hermansky in 1989 [11], andis one of nowadays the most utilized methods in speech recognition[6].

    2.7 Cepstral analysis

    The cepstrum of a signal is defined as the inverse Fourier transformof the logarithm of the spectrum of the signal. The idea behind thecepstrum is that the log spectrum of a sound can be also treated as awaveform and can be subjected to further Fourier analysis. The inde-pendent variable of the cepstrum is time, but in order to differentiate

    16

  • 2. ANALYSIS OF SPEECH

    it from time-domain analysis its called quefrency (the names cepstrumand quefrency were created by changing the order of letters in thewords spectrum and frequency respectively). Cepstral analysis hasuses in pitch detection and pattern recognition and also in techniquesfor estimation of the vocal tract system (cepstral LPC coefficients).

    2.7.1 Mel-frequency cepstral coefficients

    Mel-frequency cepstral coefficients (MFCC) are parameters of a signalused for capturing its short-time spectral characteristics in a com-pact form. The method consists of computing the power spectrum ofa windowed segment of the analyzed signal, then applying a bankof triangular shaped bandpass filters. The central frequencies andbandwidth of the filters are determined so that they cover the en-tire analyzed frequency band, are uniformly spaced on the mel-scale,and the central frequency of a given filter is also the first frequencyvalue of the next filter. The number of filters is usually based on thebandwidth of the original signal with respect to the critical band the-ory of auditory perception, e.g. for 8 kHz bandwidth 20 filters areused. The filters are applied to the signal segment separately and thesums of the output values of each filter are calculated. The MFCCsare the values of the discrete cosine transform applied to the list oflogarithms of the filter outputs. Various sources differ in some detailson the exact method of the computation of the MFCCs, the practicalpart of this thesis implements the one described in [6].

    MFCCs are commonly used in the fields of speaker recognitionand speech recognition, and are increasingly finding uses in musicinformation retrieval [12].

    17

  • Chapter 3

    Sound analysis with mathematicalsoftware

    As it can be seen from the previous chapter, sound analysis (andin general signal processing) is a very math-heavy task, so it is rea-sonable to look into software developed specifically for performingmathematical computations if aiming to get engaged in this field.This chapter is devoted to the review of some of the more widelyused mathematical software on the market and also to the compari-son of their sound processing capabilities.

    3.1 Matlab

    Matlab is a programming environment for algorithm development,data analysis, visualization, and numerical computation, developedby MathWorks[13]. It is primarily intended for numerical compu-tations, although it has an optional toolbox granting the ability toperform symbolic computations. Its user base varies on a wide scaleacross industry and academia, counting roughly one million users(2004)[14]. Along with Simulink, an additional package also devel-oped by MathWorks for multidomain simulation and Model-BasedDesign, it is popular among engineers and economists. Matlab isdistributed as a stripped down core program along with optionaltoolboxes providing additional functionality grouped by fields of us-age. While most of the functionality is accessible from the commandprompt, Matlab comes with a GUI for added convenience. Math-Works provides licensing options separately for educational and com-mercial uses.

    18

  • 3. SOUND ANALYSIS WITH MATHEMATICAL SOFTWARE

    Matlabs Signal Processing Toolbox implements numerous indus-try standard digital and analog signal processing methods. It pro-vides tools for visualizing signals in time and frequency domain,computing FFTs, FIR and IIR filter design and other signal process-ing techniques. As Matlab is also an interpreted procedural program-ming language, the tools in the signal processing toolbox can be usedto develop custom algorithms[15]. The Matlab Coder and SimulinkCoder tools can generate processor specific C/C++ code from algo-rithms prototyped in Matlab, which makes it a powerful tool in em-bedded system design (where signal processing is used to an utmostdegree), thus making it the software of choice for a large fraction ofprofessionals working in the field.

    3.2 GNU Octave

    GNU Octave (later only Octave) is a high-level interpreted language,primarily intended for numerical computations[16]. It can be con-sidered a Matlab clone, as it is very similar to Matlab, making pro-grams easily portable between the two. Unlike Matlab, Octave doesnot have a GUI (planned to be included in the next major release).It is used through its interactive command-line interface. For visual-ization the external gnuplot program is used, which is included in theOctave installation package. Octave is open-source and is releasedunder the GNU General Public License. As part of the GNU project,its development is entirely community based; people who are inter-ested are encouraged to take part. Octaves package system makesthe creation and installation of additional packages expanding thecore programs functionality effortless. A collection of collaborativedevelopment packages for Octave can be found on the Octave-forgewebsite.

    While Octave has limited signal processing capabilities on its own,Octave-forge maintains a signal package developed to broaden thespectrum of available methods. However the number of functions inthis package is only a fraction of what is in Matlab, the ones actuallyimplemented are syntactically almost identical to their counterpartsin the Signal Processing Toolbox, which makes Octave a reasonable

    19

  • 3. SOUND ANALYSIS WITH MATHEMATICAL SOFTWARE

    choice if Matlabs functionality is desired, but it is out of the budget.

    3.3 Maple

    Maple is a commercial computer algebra system (CAS). Its develop-ment began in 1980 at the University of Waterloo in Ontario, Canada;it is on the commercial market since 1988, distributed and devel-oped by Maplesoft. Maple supports numeric computation as wellas symbolic computation and visualization[17]. The power of CASsystems lie in their capabilities to perform symbolic mathematics.Computations with exact quantities such as fractions, radicals andsymbols eliminate the propagation of rounding errors, thus achiev-ing more precise results than solely numerical systems. The approx-imation of the results can be computed at arbitrary precision, notlimited by the underlying hardware. One of Maples main advan-tages over other mathematical software is its above par GUI in termsof user-friendliness and intuitiveness. Its highlights include the abil-ity to process input in standard mathematical notation, the typingof which is assisted by really well-written automatic formatting; thecontext-sensitive menus appearing when right-clicking an expres-sion and its overall natural feel. Like for Matlab, separate studentand commercial licenses exist also for Maple.

    Maple is a software which can be used in signal processing to ahigh extent given its strong mathematical and visualization abilities.It has built-in functions for the essential methods used in the field,however it is not that well supplied with implementations of addi-tional methods. The Maplesoft website and other community web-sites provide free expansions and tutorials for applications in signalprocessing. For uses specifically in sound processing the low numberof supported audio formats can be a major drawback.

    3.4 Wolfram Mathematica

    Wolfram Mathematica, developed by Wolfram Research, is an all-around commercial mathematical software and computer algebra sys-

    20

  • 3. SOUND ANALYSIS WITH MATHEMATICAL SOFTWARE

    tem that can be deployed in any area that requires technical compu-tation. It consists of a kernel, a command-line interface which inter-prets inputs as expressions in Mathematica code and returns outputsin the same format; and a front-end providing sophisticated inputand output handling, including typeset mathematics, graphics, GUIcomponents, tables and sound[18]. The number of features includedin Mathematica is second to no other mathematical software on themarket, ranging from high-performance symbolic and arbitrary pre-cision numerical computations to the support multi-threading anduser-level parallel programming. Mathematicas interface is less userfriendly and has the steepest learning curve of the software reviewedin this chapter. Student and commercial licenses are available forMathematica.

    Wolfram Mathematica is a very powerful mathematical software,which makes it suitable for signal processing purposes. The range ofsupported file formats (audio and overall) is superior to its competi-tors, including also compressed audio formats like FLAC.

    3.5 Comparison

    This section aims to compare the software described beforehandin aspects of their use in signal processing (and more specifically au-dio processing) using tables and charts as well as text description fordemonstrating the results of the study.

    Matlab1 Maple Mathematica OctaveStudent $99 $124 $126.62

    FreeCommercial (single user) $4471.22 $2,845 $3,219.93

    Table 3.1: Comparison of pricesThe prices are stated for purchase from the Czech Republic. Converted to USDwith exchange rates valid on 5/18/20121including Signal Processing Toolbox

    21

  • 3. SOUND ANALYSIS WITH MATHEMATICAL SOFTWARE

    Matlab Maple Mathematica OctaveWindow functions SPT No No YesConvolution Yes Yes Yes YesCross-correlation SPT No No signalDiscrete FT Yes Yes Yes YesSymbolic FT Symbolic Yes Yes NoShort-time FT No No No YesReal/complex cepstrum SPT No No signalLPC SPT No No signal

    Table 3.2: Comparison of built-in signal processing functionalitySPT - Signal Processing Toolbox, Symbolic - Symbolic Toolboxsignal - signal package

    Matlab Maple Mathematica* Octave.WAV LPCM LPCM/ADPCM Yes LPCM.AU (.SND) -law No Yes -law.RAW No No No LPCM.AIFF No No Yes No.w64 (Wave64) No No Yes No.FLAC No No Yes No

    Table 3.3: Comparison of supported audio formats and encodings*According to its documentation, Mathematica supports all standard raster for-mats and codecs [19]

    3.5.1 Plotting

    Each of the mentioned software have the ability to visualize datato some extent. This is an area where Octave is significantly weakerthan its commercial competitors, since it is limited by the function-ality of gnuplot, and the interfacing of the two programs does noteven allow to use its every function when called from Octave. Plotscan be manipulated from the command line interface of Octave. Itallows to create static 2d and 3d plots, with multiple functions onthe same figure. For use in sound processing its capabilities are suffi-cient, however the lack of a GUI makes working with plots tiresome.

    22

  • 3. SOUND ANALYSIS WITH MATHEMATICAL SOFTWARE

    The specgram() function of the signal package for plotting spectro-grams directly proved a valuable feature seeing that only Matlab hasan equivalent.

    The other programs have far greater visualization power in gen-eral, allowing among others to create interactive plots and anima-tions. A noteworthy feature of Mathematica is the ability to create.CDF files Wolframs own format, designed to allow easy author-ing of dynamically generated interactive content[18] that can beviewed with the CDF player downloadable form th Wolfram web-site free of charge. The CDF player integrates with the most popularbrowsers which allows .CDF documents to be embedded in HTMLpages. Demonstrations of its capabilities in the field of audio process-ing can be found in the Wolfram Demonstrations Project[20].

    23

  • Chapter 4

    The speech package for GNU Octave

    The speech package is a library of functions implementing meth-ods of speech analysis described in chapter 2 for GNU Octave. Thischapter describes this package in detail and discusses the issues facedduring its development.

    4.1 Installation

    The speech package takes advantage of the Octave package inter-face, making it effortless to install; you can do so by opening Octave,going into the directory where the package archive is located andtyping the following command to the Octave command prompt:

    pkg install speech.tar.gz

    The package is dependent on the signal package, which is itself de-pendent on a few more packages, all of which can be downloadedfor free from the Octave-forge website.

    4.2 Input handling

    The input signal can be passed to the functions of the package intwo different ways: as an array containing the values of sampled dataor as a string containing the path to a .wav file. If passed as an array,the window size is expected to be passed as a number of samples; ifpassed as a .wav file, window size is expected in milliseconds. Whilethe inclusion this feature may seem as unnecessary complication, itcan rather speed up the work flow if working with a larger numberof sound samples in separate files.

    24

  • 4. THE SPEECH PACKAGE FOR GNU OCTAVE

    The functions can handle a maximum of two channels of audiodata at once (stereo signals). If a matrix is passed as an input sig-nal, its smaller dimension is treated as the representation of sepa-rate channels, and the function is applied recursively to the first twochannels. Multi-channel audio files are treated as matrices whosecolumns represent separate channels. To get the results for both chan-nels, appropriate amount of output parameters must be specified.When calling a function with the exception of the input signal the arguments can be omitted, if so, default values are used and novisualization of the result is done. The functions lpc_cov(), plp() andmfcc() handle only vectors as input signals.

    4.3 Function reference

    Function references are included in the source codes of the func-tions in a way that Octave recognizes, so they are accessible fromOctave by typing help function_name into the command prompt. Thissection provides an overview of each of the functions included in thepackage rather than describing their syntax.

    4.3.1 The sti(), ste() and zcr() functions

    The sti(), ste() functions implement the short-time intensity andshort-time energy functions respectively, which were described insection 2.4. They take as arguments the input signal, window size,window type and an integer as an option for visualization.

    To compute the results, the functions do not implement the defi-nitions directly; the summation of sampled values inside each win-dowed segment would be excessively time-consuming. The transfor-mations are applied to each sample of the signal in advance, then itsconvolved with a window of given length. The result of the func-tion call will be a vector with the same length as the analyzed sig-nal, whose each value corresponds to the windowed signal segmentwith the middle at the same index. As a side-effect of using convo-lution, windows near the ends of the input signal handle it as if itwere padded with zeros on both sides. The built-in conv() function

    25

  • 4. THE SPEECH PACKAGE FOR GNU OCTAVE

    Figure 4.1: Sample output of the zcr() function, called with 150 points forinterpolation of the plot

    computes the convolution by calling the pre-compiled filter() func-tion, which uses FFT algorithms to get the result. This optimizationresults in a drastic improvement in run time.

    The idea behind the computation of the result of the zcr() (zero-crossing rate) function is similar: it takes the convolution of the win-dow with the array gotten by the subtraction of the signum of theoriginal signal from the signum of the signal shifted in time by onesample. The zcr() function does not take the window type as an argu-ment; for computing the zero-crossing rate the rectangular windowis used.

    For plotting the results the complementary function plot_stf() isused. It interpolates the result with the optional last argument of thementioned functions before plotting, making the plot more readableand more pleasant for the eyes. If the value passed is 0, no inter-polation is done. The results are plotted over the waveform of theanalyzed signal to make their meaning perfectly clear.

    26

  • 4. THE SPEECH PACKAGE FOR GNU OCTAVE

    4.3.2 The stf() function

    This is a complementary function, which applies a function passedto it as a function handle to windowed segments of the input sig-nal. It is a very useful tool for performing short-time analysis of asignal. For methods which combine samples inside a window in anon-trivial way to get the result, the convolution of the window withthe signal cannot be used. In most cases it is superfluous to computethe results of a function for as many windowed segments as thereare samples in the signal; the gap between the centers of adjacentwindows can be specified explicitly and passed to the function as anargument.

    4.3.3 The stacf() function

    This function implements the short-time autocorrelation functionusing the xcorr() function from the signal package, and the support-ing stf() function from the speech package. The xcorr() function per-forms cross-correlation using FFT algorithms. Called with one argu-ment, it returns the autocorrelation of the input signal (autocorre-lation is cross correlation of a signal with itself). In this case, usingthe xcorr() function makes the noticeable difference in the run timecompared to the direct implementation of the definition.

    Because the result of applying STACF to a signal is a two-dimensi-onal set of data, for its visualization a 3d plot was chosen. Calling thestacf() with an additional argument greater than zero plots the resultas a 3d mesh plot. The default viewpoint of the plot is normal to thex and y axes, the values on the z axis are represented by the colorsof the graph (similarly to a spectrogram), making the peaks of thegraph prominent. The plot can be rotated arbitrarily, to return to theoriginal viewpoint, type the command view(2) to the Octave prompt.

    27

  • 4. THE SPEECH PACKAGE FOR GNU OCTAVE

    4.3.4 The lpc_cov() function

    The signal package has a function for estimating linear predictioncoefficients using the autocorrelation method1. The lpc_cov() func-tion of the speech package implements the estimation with the co-variance method, described in 2.6 which does not have an imple-mentation in Octave.

    To find a suitable optimization for the algorithm of computing thecoefficients with this method was crucial, since computing the matrixof the linear equation system which describes the coefficients imple-menting the formula 2.15 directly would take p(p+1)(Np) iterationsof one multiplication and one addition, where p is the desired modelorder and N is the length of the analyzed signal segment. From theformula it can be seen, that each row of the matrix can be interpretedas the cross-correlation of the whole signal segment with a movingwindowed subsegment of length N p, thus allowing the matrix tobe computed by p+1 calls to the xcorr() function. The speech packageimplements this optimization.

    In Matlab, the arcov() function implements a different algorithm.It computes a different matrix of size (N p) (p + 1), the solu-tion of which is the same set of values. The matrix is computed sim-ply by cutting up the signal segment to p + 1 overlapping segmentsof length N p, which serve as the columns of a temporal matrix.Then the matrix is flipped vertically and every value is divided bythe expression

    N p. Table 4.1 compares run times of the Octave

    implementations of the three algorithms described above (the algo-rithm of Matlabs arcov() function was ported to Octave to providereasonable comparison).

    1. The documentation of the function says that it uses the Burg method for es-timation, however the function produces the same results as the aryule() functionwhich uses the Yule-Walker method (also called autocorrelation method). Matlabslpc() function produces also the same results, and is documented as using autocor-relation method. Matlabs and Octaves arburg() function using the Burg methodproduces different output, thus leading to believe there is an error in the documen-tation of Octaves lpc() function

    28

  • 4. THE SPEECH PACKAGE FOR GNU OCTAVE

    N = 600, p = 20 N = 1000, p = 40 N = 1000, p = 200

    Trivial 8.3799 0.0619s 54.1800 0.1390s > 60sSpeech pkg 0.0555 0.0015s 0.1143 0.0154s 0.8040 0.0300sMatlab-style 0.0628 0.0022s 0.1325 0.0025s 0.2853 0.0221s

    Table 4.1: Comparison of the run times of three different implemen-tations of the covariance LPC method

    The lpc_cov() function calculates the linear prediction coefficientsfor the whole input signal. To calculate them for windowed segmentsof the input signal, use it in combination with the stf() function. Sincelinear prediction coefficients are mainly used for pattern matchingpurposes in speech recognition or speaker recognition which re-quire the comparison of the coefficients of different segments andmatching them with a library of model coefficients this functiondoes not provide any kind of visualization.

    4.3.5 The strceps() and stcceps() functions

    Functions for the computation of the real and complex cepstrumare included in the signal package (rceps(), cceps()). The strceps()and stcceps() functions of the speech package apply the rceps() andcceps() functions respectively to windowed segments of the analyzedsignal using the stf() function.

    4.3.6 The plp() and mfcc() functions

    The plp() function implements the perceptual linear predictionmethod, as it is described in [6]. The output of this function is a vec-tor of length Q (model order, passed to the function as an argument),containing the PLP coefficients of the input signal. To get represen-tative results, it is recommended to use this function in combinationwith the stf() function.

    The mfcc() function works similarly to the plp() function, except itcomputes the mel-frequency cepstral coefficients of the input signal,

    29

  • 4. THE SPEECH PACKAGE FOR GNU OCTAVE

    as it is described in [6]. Again, combined use with the stf() functionis highly recommended.

    4.3.7 Complementary functions

    sgn() A modified version of the built-in sign() (signum) function,used to determine the zero-crossing rate of a signal.

    dct_n() A modified version of the dct() (discrete cosine transform)function included in the signal package, used in the mfcc() function.

    elc() This function creates an equal loudness curve for loudnesslevel 40 Phon of arbitrary length, using an approximation equationdescribed in [6]. The values are normalized to be from the interval[0,1]. Used by the plp() function.

    cbandfilt() This function creates an array of arbitrary length whichcorresponds to the impulse response of the critical band filters usedin the PLP method.

    mel2hz(), hz2mel(), bark2hz(), hz2bark() These functions imple-ment the conversion of frequency values between units [mel] and[Hz], [bark] and [Hz] respectively.

    4.4 Known issues

    When analyzing short signal segments using methods which com-pute parameters of a model that estimates the vocal tract system (i.e.the lpc_cov(), plp() and mfcc() functions), sometimes the warningmatrix singular to machine precision is thrown. This happens when thesamples in the segment carry less information than it is needed to de-scribe a model of given order, usually at the ends of the signal whereit contains silence or if the chosen model order is unreasonably high.The results obtained when the function has thrown this warning arenot valid.

    30

  • Chapter 5

    Final thoughts

    This thesis focused on the analysis of sound signals with methodsutilized primarily in the ever-expanding field of speech processing.While the possibility of computers communicating with people innatural spoken language is still available only in science-fiction, theaccomplishments of research in the field already started to make im-pact on our everyday lives. GPS devices featuring the option of syn-thesized voices giving directions became common, voice control ofvarious electronic gadgets is no news either. Automatic captioningof a few selected live broadcasts are available on the Czech NationalTelevision and the expansion of this feature is in the works. It is clearto see that despite the remarkable progress achieved in the recentpast, the field of speech processing is yet to reach its zenith and thereis a bright prospect for further research.

    While cutting-edge research requires cutting-edge software usu-ally developed with a robust financial capital in some cases open-source software offer a reasonable alternative for low-budget enter-prises. Thanks to the enthusiasm of the development community,GNU Octave is becoming increasingly usable for practical applica-tion.

    The speech package created as the practical part of this thesis isa user-collaborated expansion of GNU Octave, which enhances itsspeech processing capabilities. The methods implemented are usedprimarily in speech recognition and speaker recognition, fields thathave been gathering ground rapidly in the recent past (especially theformer automatic captioning is a hot topic nowadays). Mastering

    31

  • 5. FINAL THOUGHTS

    these fields is instrumental to achieve the goal of fluent spoken com-munication with computers, however not sufficient. The techniquesof understanding the linguistically imperfect output of speech recog-nition methods also have to take their great leap forward.

    32

  • Bibliography

    [1] Wikipedia. Sound Wikipedia, The Free Encyclopedia, 2012.[Online; accessed 20-May-2012].

    [2] Wikibooks. Signals and Systems/Definition of Signals and Sys-tems. http://en.wikibooks.org/w/index.php?title=Signals_and_Systems/Definition_of_Signals_and_Systems&oldid=2254812. [Online; accessed 20-05-2012].

    [3] Bores Signal Processing. Introduction to DSP. http://www.bores.com/courses/intro/index.htm. [Online; accessed20-05-2012].

    [4] Steven W. Smith. The scientist and engineers guide to digital sig-nal processing. California Technical Publishing, San Diego, CA,USA, 1997.

    [5] Wikipedia. Audio signal processing Wikipedia, The Free En-cyclopedia, 2012. [Online; accessed 20-May-2012].

    [6] J. Psutka L. Mller J. Matouek V. Radov. Mluvme s poctacemcesky. Academia, Praha, 2006.

    [7] L.R. Rabiner and R.W. Schafer. Introduction to Digital Speech Pro-cessing. Foundations and Trends in Signal Processing. Now Pub-lishers, 2007.

    [8] ISO 226:2003. Acoustics Normal equal-loudness-level contours.ISO, Geneva, Switzerland.

    [9] Wikipedia. File:lindos1.svg, 2012. [Online; accessed 20-May-2012].

    [10] Wikipedia. Fast fourier transform Wikipedia, The Free Ency-clopedia, 2012. [Online; accessed 20-May-2012].

    [11] H Hermansky. Perceptual linear predictive (plp) analysis ofspeech. Journal of the Acoustical Society of America, 87(4):17381752, 1990.

    33

  • 5. FINAL THOUGHTS

    [12] Wikipedia. Mel-frequency cepstrum Wikipedia, The Free En-cyclopedia, 2012. [Online; accessed 20-May-2012].

    [13] MathWorks. MATLAB The Language of Technical Comput-ing, 2012. [Online; accessed 20-May-2012].

    [14] Wikipedia. Matlab Wikipedia, The Free Encyclopedia, 2012.[Online; accessed 20-May-2012].

    [15] MathWorks. MATLAB - Documentation, 2012. [Online; ac-cessed 20-May-2012].

    [16] GNU Octave. http://www.gnu.org/software/octave/,2012. [Online; accessed 20-May-2012].

    [17] Wikipedia. Maple (software) Wikipedia, The Free Encyclo-pedia, 2012. [Online; accessed 20-May-2012].

    [18] Wikipedia. Mathematica Wikipedia, The Free Encyclopedia,2012. [Online; accessed 20-May-2012].

    [19] Wolfram Research. Mathematica 8 Documentation Center, 2012.[Online; accessed 20-May-2012].

    [20] Wolfram Research. Wolfram Demonstrations Project. http://demonstrations.wolfram.com/, 2012. [Online; accessed20-May-2012].

    34

    Introduction Organization of this thesis Digital audio processing Digital representation of sound Working with audio signals Practical application

    Analysis of speech The source/system speech production model Perception of sound Pitch perception Loudness

    Short-time analysis Window functions

    Time-domain analysis Frequency-domain analysis Fourier transform Discrete Fourier transform Short-time Fourier transform

    Linear predictive analysis Perceptual linear prediction

    Cepstral analysis Mel-frequency cepstral coefficients

    Sound analysis with mathematical software Matlab GNU Octave Maple Wolfram Mathematica Comparison Plotting

    The speech package for GNU Octave Installation Input handling Function reference The sti(), ste() and zcr() functions The stf() function The stacf() function The lpc_cov() function The strceps() and stcceps() functions The plp() and mfcc() functions Complementary functions

    Known issues

    Final thoughts


Recommended