SPEECH ANALYSIS AND SYNTHESIS By Faiga Mohammed ...

transcript

SPEECH ANALYSIS AND SYNTHESIS

Faiga Mohammed Mohammed Ahmed Alawd

INDEX NO. 084057

Supervisor

Ostaz Mohammed Jaafr Elnourani

A thesis submitted in partial fulfillment for the degree of

B.Sc. (HON)

To the Department of Electrical and Electronic Engineering

(COMMUNICATION ENGINEERING)

Faculty of Engineering

University of Khartoum

July 2013

DICLARATION OF ORIGINALITY

I declare that this report entitled “SPEECH ANALYSIS AND SYNTHESIS” is my

own work except as cited in the references. The report has not been accepted for any degree and

is not being submitted concurrently in candidature for any degree or other award.

Signature : _________________________

Name : _________________________

Date : _________________________

ABSTRACT

Speech signals are important signals in the communication systems. It must analyze in

order to obtain its important parameters and to compress it to make the maximum use of the

available bandwidth.

Speech analysis and synthesis has many techniques, each one of them had its own

advantages and disadvantages. This project seeks to investigate some of speech analysis and

synthesis techniques and make a comparison between those techniques

In this project, spectral analysis and synthesis with its two types, FFT technique and LPC

techniques was implemented using Matlab code and Matlab GUI to display the results of

analysis and synthesis, the GUI was named SAS abbreviations to Speech Analysis and Synthesis.

المستخلص

هم اإلشارات المستخدمة فى نظم اإلتصاالت لذلك يجب تحليلها لكى نتحصل على المعامالت أمن حديثإشارة ال

المهمة منها و كذلك لكى نستطيع أن نضغطها للحصول على اإلستخدام األمثل لعرض النطاق المتوفر.

له تقنيات عديدة لكل منها محاسنه و مثالبه . هذا المشروع يبحث عن التعرف على بعض حديثتركيب التحليل و

و صنع مقارنة بين هذه األنواع المختلفة. حديثأنواع تحليل و تكريب ال

من نوع الطيف الترددى بأنواعه اإلثنين المتمثلين فى تحويل حديثفى هذا المشروع تم دراسة تحليل و تركيب ال

عوامل المستنتجة خطياً بإستخدام برنامج الماتالب من ثم تم عرض النتائج باستخدام واجه دليل المستخدم التى توجد فوريير وال

.حديثفى برنامج الماتالب. هذه الواجهة سميت باألحرف األولى للمعنى العام للمشروع و هو تحليل و تركيب ال

ACKNOWLEDGEMENT

Unlimited praises for Allah as the number of his creatures, the gratification of himself,

the weight of his throne, and the extension of his words.

I would like to thank a few persons who have contributed to the completion of the following

thesis. Firstly, I would like to thank my advisor, Ostaz Mohammed Jaafer Alnourani for his constant

guidance and motivation. He always available to discuss my thesis, edit my manuscript and offer

encouragement. This thesis would not have been possible without him. I would also like to thank my

parents, and my extended family for their support and motivation while writing my thesis.

Also I specialize many thanks for my friend and partner Randa Abd Almonem Sabir , we

were destined to meet and become friends. And to my sister Wefag Mohammed, she always

there for me when I need her.

It was a great pleasure having the opportunity to work with and learn from such high

quality teachers in the University of Khartoum faculty of engineering. They taught me that in

order to be a good engineer you have to be a good person in the first place.

DEDICATION

To the person who taught me the true meaning of the true love, the meaning of giving for infinity

without limits,

To my mother

Suad Al-Shekh Taha

TABLE OF CONTENTS

TITLE i

DICLARATION OF ORIGINALITY ii

ABSTRACT iii

iv المستخلص

ACKNOWLEDGEMENT v

DEDICATION vi

TABLE OF CONTENTS vii

LIST OF FIGURES x

LIST OF TABLES xi

LIST OF ABBREVIATIONS AND TERMINOLOGIES xii

CHAPTER 1: INTRODUCTION 1

1.1. Introduction 1

1.2. MOTIVATION 1

1.3. Problem statement 1

1.4. Objectives 2

1.5. Methodology 2

1.6. Thesis layout 2

CHAPTER 2: LITERATURE REVIEW 4

2.1. Introduction 4

2.2. Speech Signals

2.2.1 Origin of speech signals

2.2.2 Classification of Speech Signals

2.3. Speech communication 6

2.4. Speech analysis

2.4.1 Spectral Analysis of Speech

2.4.2 Homomorphic Analysis

2.4.3 Formant analysis

2.4.4 Analysis of Voice Pitch

2.5. Speech synthesis

2.5.1 Synthesizer technologies

2.5.1.1 Concatenative synthesis

2.5.1.2 Formant synthesis

2.5.2 Text to speech

2.6. Similar papers and researches

2.6.1 Speech Analysis and Synthesis by Linear Prediction of the

Speech Wave

2.6.2 SPECTRAL ENVELOPES AND INVERSE FFT

SYNTHESIS

CHAPTER 3: MOTHODOLOGY 16

3.1. Introduction 16

3.2. FFT method 16

3.2.1. Short-Time Frequency Analysis

3.2.2. Choice of the weighting function

3.3. LPC method 18

3.3.1 The Levinson–Durbin Algorithm

3.3.2 Linear Predictive Coding (LPC) Vocoder

3.3.3 VOICE EXCITED LINEAR PREDICTION (VELP)

3.3.4 RESIDUAL-EXCITED LINEAR PREDICTIVE

(RELP) VOCODER

3.4. Matlab implementation 24

CHAPTER 4: IMPLEMENTATION AND RESULTS 29

4.1. Introduction 29

4.2. MATLAB IMPLEMENTATION USING GUI 29

4.3. SPEECH ANALYSIS AND SYNTHESIS TECHNIQUES: 31

4.3.1. Short time Fourier transforms 32

4.3.2. Linear predicative coding 34

4.4. COMPRESSION BETWEEN FFT AND LPC TECHNIQUES 39

CHAPTER 5: CONCLUSION AND FUTURE WORK 40

5.1. Conclusion 40

5.2. Future work 40

REFERENCES

APPENDIX A

APPENDIX B

LIST OF FIGURES Figure Number

Figure 2.1

Diagram of the human speech production system

Figure 2.2 Source-Filter Model of Speech 6

Figure 2.3 Homomorphic analyzer 9

Figure 2.4 Overview of a typical TTS system 14

Figure 2.5 Block diagram of the speech synthesizer by Linear Prediction 15

Figure 3.1 The rectangular window 17

Figure 3.2 The hamming window 18

Figure 3.3 the hanning window 18

Figure 3.4 simple speech production 19

Figure 3.5 Encoder and decoder for LPC vocoder 23

Figure 3.6 LPC analyzer 23

Figure 3.7 Block diagram of voice-excited LPC Vocoders 24

Figure 3.8 Flow chart of the FFT analysis 25

Figure 3.9 flow chart of the FFT synthesis 26

Figure 3.10 Flow chart of the LPC analysis 27

Figure 3.11 Flow chart of the LPC synthesis 28

Figure 4.1 GUI interference in Matlab “SAS” The wave form of the recorded

signal

Figure 4.2 The wave form of the recorded signal 03

Figure 4.3 The spectrogram of speech signal under consideration 03

Figure 4.4 GUI interference in Matlab “SAS” with a popup menu. 09

Figure 4.5

Figure 4.6

the original speech signal “warn4.wav”

3D representation of the short time Fourier transform for each

Figure 4.7 compressed speech signal using the FFT 34

Figure 4.8 The residual signal generated from LPC analyzer. 35

Figure 4.9 the resulted LPC compressed signals along with the original signal 37

Figure 4.10

The residual signal that used in the voiced-exited LPC after it had

DCT transform and noise was added to it

Figure 4.11 Random residual signal 39

Figure A.1 GUIDE layout editor A.1

LIST OF TABLES

Table Number

Table 4.1

The matrix size of the different LPC parameters in the case of

“Hello.wav” signal.

Table A.1 GUI tools Summary. A.2

Table A.2 Guide components that have been used with a short description for

each of them

LIST OF ABBREVIATIONS

LPC Linear predicative coding

FFT Fast Fourier transform

DSP Digital signal processing

TTS Text to speech

inverse Fast Fourier transform

IIR infinite impulse response

LP linear predictive

AR Autoregressive

VELP voiced excited linear predicative

DCT discrete cosine transform

RELP residual excited linear predicative

SAS Speech analysis and synthesis

GUI guide user interference

CHAPTER 1

INTRODUCTION

1.1 NTRODUCTION

This chapter was written in order to engage the reader with the ideas behind this thesis. It

consists of the project background, problem statement, objectives, methodology, and project

scope. In addition, an overview of report layout was presented

1.2 MOTIVATION

`Accompanying the explosive growth of the Internet is the growing need for audio

compression, or data compression in general. The major goal in audio compression is to

compress the audio signal, either for reducing transmission bandwidth requirements or for

reducing memory storage requirements, without sacrificing quality.

Digital cellular phones, for example, use some type of compression algorithm to

compress in real-time the voice signal over general switched telephone networks. Audio

compression can also used off-line to reduce storage requirements for mail forwarding of voice

messages or for storing voice messages in a digital answering machine.

Speech, and other audio, signals represented by sample data are often compressed by

being quantized to a low bit rate during data transmission in order to obtain faster data transfer

rates. Speech compression allows smaller bandwidth, higher data rates or a combination of these

attributes. It can also be used to store speech like data in a compact form. To compress the

speech signal, speech analysis techniques must applied first.

1.3 PROBLEM STATEMENT

The core objective of this project is to understand different audio signal processing

techniques mainly LPC and FFT analysis and synthesis techniques, and to have a profound

knowledge on how to implement it on Matlab, to observe its characteristics with different

parameters; to compare the results obtained, in general to understand the objectives and basic

techniques in audio signal processing.

The main objective of audio coding or compression to avoid bandwidth and storage issues

associated with audio recording, transmission and storage. This can be achieved by representing

the signal with a minimum number of bits while achieving transparent signal production.

1.4 OBJECTIVES

This project aim to implement a Matlab code that analysis and synthesis the speech signal using

the following techniques:

FFT analysis

FFT synthesis

LPC analysis

LPC synthesis

Design a Matlab GUI that makes it easy to deal with the Matlab code.

Note that the quality of the compressed signal that results from both methods is a bottle neck in

this implementation. The compressed speech must be same to that of the source speech.

1.5 METHODOLOGY

A Matlab GUI was implemented to make a visualized tool for speech analysis and

synthesis, the speech signal was drawn in the time and frequency domains to emphasize its

characteristics then compressed using FFT and LPC techniques.

1.6 THESIS LAYOUT

This section presented brief information about the rest of the thesis chapters included its

appendix.

In Chapter 2 (the Literature Review), the speech signal was introduced in conjunction

with its relationship to the communication systems, also different analysis and synthesis

techniques were introduced. Finally, some of the previous work and research papers related to

speech analysis and synthesis were discussed briefly.

In Chapter 3 (the methodology), the FFT and LPC method were introduced as a base

techniques from the Matlab code.

In Chapter 4 (Implementation and Results), the detailed design of the Matlab GUI code

was provided and show graphs for the resulted compressed speech signals in order to compare

In Chapter 5 (Conclusion and Future work), the results was discussed related to the

objective and showing the future work and recommendations.

The References part contains the used citations indexed by numbers. It included different

books, websites, and papers.

Appendix A contains a brief description for the Matlab GUI. In Appendix B, the Matlab

code used in this project was presented.

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This chapter introduces the speech signals from the origins to its uses in communication

hence it will begin with a brief summary of the speech signal and a short explanation of the

origins of the speech signal and its classifications.

The sections that follow describe the main types of the speech analysis and synthesis. Then

this chapter ends with the previous work related to speech analysis and synthesis of speech.

2.2 Speech Signals

2.2.1 Origin of speech signals

The speech waveform is a sound pressure wave originating from controlled movements

of anatomical structures making up the human speech production system. A simplified structural

view is shown in Figure 2.1. Speech is basically generated as an acoustic wave that is radiated

from the nostrils and the mouth when air is expelled from the lungs with the resulting flow of air

perturbed by the constrictions inside the body. It is useful to interpret speech production in terms

of acoustic filtering. The three main cavities of the speech production system are nasal, oral, and

pharyngeal forming the main acoustic filter. The filter is excited by the air from the lungs and is

loaded at its main output by radiation impedance associated with the lips. [1]

Figure 2.1: Diagram of the human speech production system

The vocal tract refers to the pharyngeal and oral cavities grouped together. The form and

shape of the vocal and nasal tracts change continuously with time, creating an acoustic filter with

time-varying frequency response. As air from the lungs travels through the tracts, the frequency

spectrum is shaped by the frequency selectivity of these tracts. The resonance frequencies of the

vocal tract tube are called formant frequencies or simply formants, which depend on the shape

and dimensions of the vocal tract. [1]

Inside the larynx is one of the most important components of the speech production

system the vocal cords. The location of the cords is at the height of the „„Adam‟s apple‟‟. Vocal

cords are a pair of elastic bands of muscle and mucous membrane that open and close rapidly

during speech production. The speeds by which the cords are open and close are unique for each

individual and define the feature and personality of the particular voice. [1]

The speech signal is created at the Vocal cords, Travels through the Vocal tract, and

Produced at speakers mouth, then gets to the listeners ear as a pressure wave. Speech waveform

is representation of the amplitude variations in the signal as a function of time. The speech signal

can be in the form of sinusoidal speech signal or harmonics speech signal wave.

2.1.1 Classification of Speech Signals

Roughly speaking, a speech signal can be classified as voiced or unvoiced. Voiced

sounds are generated when the vocal cords vibrate in such a way that the flow of air from the

lungs is interrupted periodically, creating a sequence of pulses to excite the vocal tract. With the

vocal cords stationary, the turbulence created by the flow of air passing through a constriction of

the vocal tract generates unvoiced sounds. In time domain, voiced sound is characterized by

strong periodicity present in the signal, with the fundamental frequency referred to as the pitch

frequency, or simply pitch. For men, pitch ranges from 50 to 250 Hz, while for women the range

usually falls somewhere in the interval of 120 to 500 Hz. Unvoiced sounds, on the other hand, do

not display any type of periodicity and are essentially random in nature. [1]

For the voiced frame, there is clear periodicity in time domain, where the signal repeats

itself in a quasiperiodic pattern; and also in frequency domain, where a harmonic structure is

observed. Hence the spectrum indicates dominant low-frequency contents, due mainly to the

relatively low value of the pitch frequency. For the unvoiced frame, however, the signal is

essentially random. In the spectrum there is a significant amount of high-frequency components,

corresponding to rapidly changing signals. It is necessary to indicate that the voiced / unvoiced

classification might not be absolutely clear for all frames, since during transitions (voiced to

unvoiced or vice versa) there will be randomness and quasiperiodicity that is difficult to judge as

strictly voiced or strictly unvoiced. [1]

For most speech coders, the signal is processed on a frame-by-frame basis, where a frame

consists of a finite number of samples. The length of the frame is selected in such a way that the

statistics of the signal remain almost constant within the interval. This length is typically

between 20 and 30 ms, or 160 and 240 samples for 8-kHz sampling. [1]

The source–filter model of speech production was shown in Figure 2.2, models speech as

a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal

tract (and radiation characteristic). [4]

While only an approximation, the model is widely used in a number of applications

because of its relative simplicity. It is used in both speech synthesis and speech analysis, and is

related to a linear prediction. [4]

In implementation of the source-filter model of speech production, the sound source, or

excitation signal, is often modeled as a periodic impulse train, for voiced speech, or white noise

for unvoiced speech. The vocal tract filter is, in the simplest case, approximated by an all-pole

filter, where the coefficients are obtained by performing linear prediction to minimize the mean-

squared error in the speech signal to be reproduced. Convolution of the excitation signal with the

filter response then produces the synthesized speech. [4]

2.3 Speech communication

Figure 2.2: Source-Filter Model of Speech

The purpose of speech is communication, i.e., the transmission of messages. There are

several ways of characterizing the communications potential of speech. One highly quantitative

approach is in terms of information theory ideas, as introduced by Shannon.

According to information theory, speech can be represented in terms of its message

content, or information. An alternative way of characterizing speech is in terms of the signal

carrying the message information, i.e., the acoustic waveform. [2]

The message is first represented in some abstract form in the brain of the speaker later the

information in the message is ultimately converted to an acoustic signal.

The information that is communicated through speech is intrinsically of a discrete nature;

i.e., it can be represented by a concatenation of elements from a finite set of symbols. The

symbols from which every sound can be classified are called phonemes. Each language has its

own distinctive set of phonemes, typically numbering between 20 and 50. For example, English

can be represented by a set of around 40 phonemes. [2]

A central concern of information theory is the rate at which information is conveyed. For

speech a crude estimate of the information rate can be obtained by noting that physical

limitations on the rate of motion of the articulators require that humans produce speech at an

average rate of about 10 phonemes per second. [2]

In speech communication systems, the speech signal is transmitted, stored, and processed

in many ways. Technical concerns lead to a wide variety of representations of the speech signal.

In general, there are two major concerns in any system:

1. Preservation of the message content in the speech signal.

2. Representation of the speech signal in a form that is convenient for transmission or

storage, or in a form that is flexible so that modifications can be made to the speech signal (e.g.,

enhancing the sound) without seriously degrading the message content. [2]

The representation of the speech signal must be such that the information content can

easily be extracted by human listeners, or automatically by machine.

2.4 Speech analysis

Analysis of speech sounds taking into consideration their method of production and the

level of processing between the digitized acoustic waveform and the acoustic feature vectors also

the extraction of ``interesting'' information as an acoustic vector.

2.4.1 Spectral Analysis of Speech

Frequency-domain representation of speech information appears advantageous from two

standpoints. First, the acoustic analysis of the vocal mechanism shows that the normal mode or

natural frequency concept permits concise description of speech sounds. Second, clear evidence

exists that the ear makes a crude frequency analysis at an early stage in its processing. Any

spectral measure applicable to the speech signal should therefore reflect temporal features of

perceptual significance as well as spectral features. [3]

The purpose of spectral analysis is to find out how acoustic energy is distributed across

frequency. Typical uses in phonetics are discovering the spectral properties of the vowels and

consonants of a language, comparing the productions of different speakers, or finding

characteristics that point forward to speech perception or back to articulation. [3]

Formerly, calculation was time-consuming so it was more practical to work on the lab bench

using bandpass filters and then measure the filter output at a range of frequencies. From the

1950s onward, this was done by the spectrograph that burnt a spectrogram onto paper as a

permanent record. Nowadays, a suitable computer program will calculate speech spectra in

seconds. [3]

There are two methods for spectral analysis: the fast Fourier transforms (FFT) and linear

prediction (LPC). FFT finds the energy distribution in the actual speech sound, whereas LPC

estimates the vocal tract filter that shaped that speech. The advantage of FFT is easier setup; the

disadvantage is difficulty identifying formants by speakers with higher pitched voices. LPC has

better success with high-pitched voices, but the settings need to be carefully tuned for each

speaker. [3]

2.4.2 Homomorphic Analysis

The approach is based on the observation that the mouth output pressure is approximately the

linear convolution of the vocal excitation signal and the impulse response of the vocal tract.

Homomorphic filtering is applied to deconvolve the components and provide for their individual

processing and description. [3]

The analyzer is based on a computation of the cepstrum considered as the inverse Fourier

transform of the log magnitude of the Fourier transform as shown in Figure 2.3 The transmitted

parameters represent pitch and voiced-unvoiced information and the low‐time portion of the

cepstrum representing an approximation to the cepstrum of the vocal‐tract impulse response. [8]

In the synthesis, the low‐time cepstral information is transformed to an impulse response

function, which is then convolved with a train of impulses during voiced portions or a noise

waveform during unvoiced portions to reconstruct the speech. [8]

Since no phase information is retained in the analysis, phase must be regenerated during

synthesis. Either a zero‐phase or minimum‐phase characteristic can be obtained by simple

weighting of the cepstrum before transformation. The analysis consists of a measurement of the

cepstrum and a characterization of the excitation function by means of a voice-unvoiced decision

and measurement of the pitch during voicing. The parameters used to characterize the spectral

envelope are samples of the cepstrum. [8]

Since the excitation function introduces into the cepstrum sharp peaks at multiples of a pitch

period, we would generally choose the cutoff time to be less than the smallest expected pitch

period. The cepstrum is obtained by weighting the input speech with a suitable window. [8]

A key difference between the Homomorphic filtering of voiced and unvoiced speech is that

in the latter the source and filter components overlap in the low-quefrency region. [7]

Figure 2.3: Homomorphic analyzer

Properties of Homomorphic Filtering [7]

1. It is a non-parametric (transform-based) deconvolution technique.

2. It allows both poles and zeros to be represented.

3. It has wider spurious resonances consistent with the spectral smoothing.

4. It produces due interpretation of cepstral liftering.

5. Can provide a minimum-phase or a mixed-phase estimate of the vocal tract impulse

response by using the complex cepstrum.

6. Though more “natural”, than its counterpart in linear prediction, the resultant sound is

sometimes characterized as “muffled”.

2.4.3 Formant analysis

Formants are defined by Gunnar Fant as the spectral peaks of the sound spectrum |P (f)|, of

the voice. In speech science and phonetics, formant is also used to mean an acoustic resonance of

the human vocal tract. It is often measured as an amplitude peak in the frequency spectrum of the

sound, using a spectrogram or a spectrum analyzer, though in vowels spoken with a high

fundamental frequency, as in a female or child voice, the frequency of the resonance may lie

between the widely-spread harmonics and hence no peak is visible. In acoustics, it refers to a

peak in the sound envelope and/or to a resonance in sound sources, notably musical instruments,

as well as that of sound chambers. [3]

In the analysis of speech signals, formant parameters are commonly used to characterize the

vocal tract. Due to the movement of articulators during speech production, the formant

parameters vary with time .These variations are usually slow except in the case of certain speech

sounds of a highly dynamic nature. The relatively more rapid variations in vocal tract

characteristics due to vocal fold oscillations can be understood by considering the widely used

source filter model for voiced speech. [9]

2.4.4 Analysis of Voice Pitch

Pitch in speech is the relative highness or lowness of a tone as perceived by the ear,

which depends on the number of vibrations per second produced by the vocal cords. Pitch is the

main acoustic correlate of tone and intonation. [12]

Voice pitch analysis is a technique that examines changes in the relative vibration

frequency of the voice to measure emotional response to stimuli. Such analysis can be used to

determine which verbal responses reflect an emotional commitment and which are merely low

involvement responses. Such emotional reactions are measured with audio adapted computer

equipment. [10]

Pitch defines two parameters:

a) Low pitch

b) High Pitch

2.5 Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for

this purpose is called a speech synthesizer, and can be implemented in software or hardware

products. The quality of a speech synthesizer is judged by its similarity to the human voice and

by its ability to be understood [5]

Synthesis is the inverse operation of the analysis; it used to obtain the original speech after

it has been modified with any method of the analysis methods. There are many techniques used

for synthesis.

Synthesis can be defined as the process in which a speech decoder generates the speech

signal based on the parameters it has received through the transmission line, or it can be a

procedure performed by a computer to estimate some kind of a presentation of the speech signal

given a text input. [13]

2.5.1 Synthesizer technologies

The two primary technologies generating synthetic speech waveforms are concatenative

synthesis and formant synthesis. Each technology has strengths and weaknesses, and the

intended uses of a synthesis system will typically determine which approach is used. [5]

2.5.1.1 Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of segments

of recorded speech. Generally, concatenative synthesis produces the most natural-sounding

synthesized speech. However, differences between natural variations in speech and the nature of

the automated techniques for segmenting the waveforms sometimes result in audible glitches in

the output. There are three main sub-types of concatenative synthesis. [5]

i. Unit selection synthesis

Unit selection synthesis uses large databases of recorded speech. During database

creation, each recorded utterance is segmented into some or all of the following: individual

phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically,

the division into segments is done using a specially modified speech recognizer set to a "forced

alignment" mode with some manual correction afterward, using visual representations such as

the waveform and spectrogram. An index of the units in the speech database is then created

based on the segmentation and acoustic parameters like the fundamental frequency (pitch),

duration, position in the syllable, and neighboring phones. At run time, the desired target

utterance is created by determining the best chain of candidate units from the database (unit

selection). This process is typically achieved using a specially weighted decision tree. [5]

Unit selection provides the greatest naturalness, because it applies only a small amount of

digital signals processing (DSP) to the recorded speech. DSP often makes recorded speech sound

less natural, although some systems use a small amount of signal processing at the point of

concatenation to smooth the waveform. The output from the best unit-selection systems is often

indistinguishable from real human voices, especially in contexts for which the TTS system,

which defined in section 2.5.2, has been tuned. However, maximum naturalness typically require

unit-selection speech databases to be very large, in some systems ranging into the gigabytes of

recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been

known to select segments from a place that results in less than ideal synthesis (e.g. minor words

become unclear) even when a better choice exists in the database. [5]

ii. Diphone synthesis

Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-

sound transitions) occurring in a language. The number of diphones depends on the phonotactics

of the language. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and

the robotic-sounding nature of formant synthesis, and has few of the advantages of either

approach other than small size. As such, its use in commercial applications is declining, although

it continues to be used in research because there are a number of freely available software

implementations. [5]

iii. Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create

complete utterances. It is used in applications where the variety of texts the system will output is

limited to a particular domain, like transit schedule announcements or weather reports. The

technology is very simple to implement, and has been in commercial use for a long time, in

devices like talking clocks and calculators. The level of naturalness of these systems can be very

high because the variety of sentence types is limited, and they closely match the prosody and

intonation of the original recordings. [5]

2.5.1.2 Formant synthesis

This is the oldest method for speech synthesis, and it dominated the synthesis

implementations for a long time. It is based on the well-known source-filter model which means

to generate periodic and non-periodic source signals and to feed them through a resonator circuit

or a filter that models the vocal tract. The principles are thus very simple, which makes formant

synthesis flexible and relatively easy to implement. Also, formant synthesis can be used to

produce any sounds. On the other hand, the simplifications made in the modeling of the source

signal and vocal tract inevitably lead to somewhat unnatural sounding result. [13]

In a rudely simplified implementation, the source signal can be an impulse train or a saw

tooth wave, together with a random noise component. To improve the speech quality and to gain

better control of the signal, it is naturally advisable to use as accurate model as possible.

Typically the adjustable parameters include at least the fundamental frequency, the relative

intensities of the voiced and unvoiced source signals, and the degree of voicing. The vocal tract

model usually describes each formant by a pair of filter poles so that both the frequency and the

bandwidth of the formant can be determined. To make intelligible speech, at least three lowest

formants should be taken into account, but including more formants usually improves the speech

quality. The parameters controlling the frequency response of the vocal tract filter and those

controlling the source signal are updated at each phoneme. The vocal tract model can be

implemented by connecting the resonators either in cascade or parallel form. [13]

In addition to the resonators that model the formants, the synthesizer can contain filter

that model the shape of the glottal waveform and the lip radiation, and also an anti-resonator to

better model the nasalized sounds. [13]

2.5.2 Text to speech

A text-to-speech (TTS) system converts normal language text into speech. Synthesized

speech can be created by concatenating pieces of recorded speech that are stored in a database.

Systems differ in the size of the stored speech units; a system that stores phones or diaphones

provides the largest output range, but may lack clarity. For specific usage domains, the storage of

entire words or sentences allows for high-quality output. Alternatively, a synthesizer can

incorporate a model of the vocal tract and other human voice characteristics to create a

completely "synthetic" voice output. [5]

Figure 2.4: Overview of a typical TTS system

2.6 Similar papers and researches

2.6.1 Speech Analysis and Synthesis by Linear Prediction of the Speech Wave [6]

B. S. ATAL AND SUZANNE L. HANAUER presented a method for automatic analysis

and synthesis of speech signals by representing them in terms of time-varying parameters related

to the transfer function of the vocal tract and the characteristics of the excitation.

An important property of the speech wave, namely, its linear predictability, forms the

basis of both the analysis and synthesis procedures. Unlike the speech analysis methods based on

Fourier analysis, this method drives the speech parameters from a direct analysis of the speech

In this paper they describe a parametric model for representing the speech signal in the

time domain they discuss methods for analyzing the speech wave to obtain these parameters and

for synthesizing the speech wave from them.

The speech signal is synthesized by a single recursive filter. The synthesizer, thus, does

not require any information about the individual formants and the formants need not be

determined explicitly during analysis.

Moreover, the synthesizer makes use of the formant bandwidths of real speech, in

contrast formant synthesizers which use fixed bandwidths for each formant.

Figure 2.5: Block diagram of the speech synthesizer by Linear Prediction

2.6.2 SPECTRAL ENVELOPES AND INVERSE FFT SYNTHESIS [14]

X. Rodet & P. Depalle presented a new additive synthesis method based on spectral

envelopes and inverse Fast Fourier Transform (FFT-1

). User control is facilitated by the use of

spectral envelopes to describe the characteristics of the short term spectrum of the sound in terms

of sinusoidal and noise components. Such characteristics can be given by users or obtained

automatically from natural sounds. Use of the inverse FFT reduces the computation cost by a

factor on the order of 15 compared to oscillators. They propose a low cost real-time synthesizer

design allowing processing of recorded and live sounds, synthesis of instruments and synthesis

of speech and the singing voice.

In usual implementations of additive synthesis the frequency (f j) and the amplitude (a j)

of the partials are obtained at each sample by linear interpolation of breakpoint functions of time

which describe the evolution of f j and a j versus time. When the number of partials is large,

control by the user of each individual breakpoint function becomes impossible in practice.

Another argument against such breakpoint functions is as follows. In the case of the voice and of

certain instruments, a source filter model is an adequate representation of some of the behavior

of the partials. Then the amplitude of a component is a function of its frequency, i.e. the transfer

function of the filter. That is, the amplitude a j can be obtained automatically by evaluating some

spectral function.

CHAPTER 3

MOTHODOLOGY

3.1 INTRODUCTION

In this chapter the spectral analysis with its two types the FFT and LPC was described

briefly. Also some types of voice vocoder are represented in order to understand their role in

speech quality improvement.

The Matlab implementation is commonly represented as flow chart diagrams in order to

make it easy to capture the idea behind the code; this is illustrated in section 3.5.

3.2 FFT METHOD

3.2.1 Short-Time Frequency Analysis

The conventional mathematical link between an aperiodic time function f (t) and its

complex amplitude density spectrum F(ω) is the Fourier transform-pair:

𝐹 𝜔 = 𝑓 𝑡 𝑒−𝑗𝜔𝑡 𝑑𝑡

𝑓 𝑡 = −1

2𝜋 𝐹(𝜔)𝑒𝑗𝜔𝑡

For the transform to exist, 𝑓(𝑡) 𝑑𝑡

− must be finite. Generally, a continuous speech signal

neither satisfies the existence condition nor is known over all time. The signal must consequently

be modified so that its transform exists for integration over known past values. Further, to reflect

significant temporal changes, the integration should extend only over times appropriate to the

quasi-steady elements of the speech signal. Essentially a running spectrum is desired, with real-

time as an independent variable, and in which the spectral computation is made on weighted past

values of the signal. [3]

Such a result can be obtained by analyzing a portion of the signal through a specified

time window, or weighting function. The window is chosen to insure that the product of signal

and window is Fourier transformable. [3]

3.2.1 Choice of the Weighting Function, 𝒉(𝒕)

A window function is a mathematical function that has a value only inside a chosen

interval and is zero-valued otherwise. [15]

In speech applications, it usually is desirable for the short-time analysis to discriminate

vocal properties such as voiced and unvoiced excitation, fundamental frequency, and formant

structure. The choice of the analyzing time window h(t) determines the compromise made

between temporal and frequency resolution. [3]

The rectangular window is the simplest window, it shown in Figure 3.1; it is equivalent

to replacing all values except N values of a data sequence by zeros, making it appear as though

the waveform suddenly turns on and off. [15]

Which it defined as:

𝑤 𝑛 = 1 , 0 ≤ 𝑛 < 𝑁0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Figure 3.1: The rectangular window

Although its simplicity, the rectangular window need a sharp ends which does not exist at

the practice that led to discontinuities at the ends.

One way to avoid discontinuities at the ends is to taper the signal to zero or near zero and

hence reduce the mismatch. This can perform using a humming window which is the common

window in the speech analysis and defined as:

𝑤𝑛 = 0.54 − 0.46 cos(2𝜋𝑛

𝑁 − 1)

0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

, 0 ≤ 𝑛 < 𝑁

This is simply a raised cosine and is plotted out in figure 3.2.

Figure 3.2: The hamming window

The hamming window is optimized version of the hanning window, shown in Figure 3.3,

to minimize the maximum (nearest) side lobe, giving it a height of about one-fifth that of the

Hann window which it‟s raised cosine window is defined by:

𝑤 𝑛 = 0.5 (1 − cos 2𝜋𝑛

𝑁 − 1 )

Figure 3.3 the hanning window

3.3 LPC METHOD

LPC (Linear Predictor Coding) is a method to represent and analyze human speech. The

idea of coding human speech is to change the representation of the speech. Representation when

using LPC is defined with LPC coefficients and an error signal, instead of the original speech

signal. The LPC coefficients are found by LPC estimation which describes the inverse transfer

function of the human vocal tract.

This method is used to successfully estimate basic speech parameters like pitch, formants

and spectra. The principle behind the use of LPC is to minimize the sum of the squared

differences between the original speech signal and the estimated speech signal over a finite

duration. This could be used to give a unique set of predictor coefficients. These predictor

coefficients are normally estimated every frame.

Both speech analysis and synthesis is based on modeling the Vocal tract as a linear All-

Pole (IIR) filter having the system transfer function:

𝐻 𝑧 =𝐺

1 − 𝑎𝐾𝑧−𝑘

Where G is the filter Gain, and 𝑎𝐾 is the parameters that determine the poles. [1]

The two most commonly used methods to compute the coefficients are, but not limited

to, the covariance method and the auto-correlation formulation. The auto-correlation formulation

method was used in this project because it is superior to the covariance method in the sense that

the roots of the polynomial in the denominator of the above equation is always guaranteed to be

inside the unit circle, hence guaranteeing the stability of the system H (z). Levinson - Durbin

recursion was utilized to compute the required parameters for the auto-correlation method. The

block diagram of simplified model for speech production can be seen in Figure. 3.4. [1]

Figure 3.4: simple speech production

There are two mutually exclusive ways excitation functions to model voiced and

unvoiced speech sounds. For a short time-basis analysis, voiced speech is considered periodic

with a fundamental frequency of F o, and a pitch period of 1/ F o, which depends on the speaker.

Hence, Voiced speech is generated by exciting the all pole filter model by a periodic impulse

train. On the other hand, unvoiced sounds are generated by exciting the all-pole filter by the

output of a random noise generator. [1]

The parameters of the all-pole filter model are determined from the speech samples by

means of linear prediction. To be specific the output of the Linear Prediction filter is

𝑠 𝑛 = − 𝑎𝑝 𝑘 𝑠(𝑛 − 𝑘)

𝑘=1

And the corresponding error between the observed sample S (n) and the predicted value is:

𝑒 𝑛 = 𝑠 𝑛 − 𝑠 𝑛

By minimizing the sum of the squared error the pole parameters 𝑎𝑝 𝑘 of the model were

determined.

𝑎𝑝 𝑘 𝑟𝑠𝑠(𝑚 − 𝑘)

𝑘=1

= −𝑟𝑠𝑠(𝑚)

Where m=1, 2 … p

Where 𝑟𝑠𝑠(𝑚) represent the autocorrelation of the sequence 𝑠 𝑛 defined as:

𝑟𝑠𝑠 𝑚 = 𝑠 𝑛 𝑠(𝑛 + 𝑚)

𝑛=0

The equation above can be expressed in matrix form as:

𝑅𝑠𝑠𝑎 = −𝑟𝑠𝑠(𝑚)

Where 𝑅𝑠𝑠𝑎 is autocorrelation matrix, ssr is an autocorrelation vector, and, a, is a vector of

model parameters. [1]

These equations can be solved in MATLB by using the Levinson-Durbin algorithm. The gain

parameter of the filter can be obtained by the input-output relationship as follow:

𝑠 𝑛 = − 𝑎𝑝 𝑘 𝑠 𝑛 − 𝑘 + 𝐺𝑥(𝑛)

𝑘=1

Where X (n) represent the input sequence.

Further, this equation was manipulated and the error sequence can be given in terms of:

𝐺𝑥 𝑛 = 𝑠 𝑛 + 𝑎𝑝 𝑘 𝑠 𝑛 − 𝑘 = 𝑒(𝑛)

𝑘=1

𝐺2 𝑥2 𝑛 = 𝑒2(𝑛)

𝑁−1

𝑛=0

𝑁−1

𝑛=0

If the input excitation is normalized to unit energy by design, then

𝐺2 𝑥2 𝑛 = 𝑒2(𝑛)

𝑁−1

𝑛=0

𝑁−1

𝑛=0

= 𝑎𝑝 𝑘 𝑟𝑠𝑠(𝑘)

𝑘=1

+ 𝑟𝑠𝑠(0)

Where 𝐺2is set equal to the residual energy resulting from the least square optimization,

once the LPC coefficients are computed, then input speech frame was determined voiced or non

voiced, and if it is indeed voiced sound, then the pitch was determined. The pitch was

determined by computing the following sequence in Matlab:

𝑟𝑒 𝑛 = 𝑟𝑎 𝑘 𝑟𝑠𝑠(𝑛 − 𝑘)

𝑘=1

Where )(kra is defined as follow:

𝑟𝑎 𝑛 = 𝑎𝑎 𝑘 𝑎𝑝(𝑖 + 𝑘)

𝑘=1

This is defined as the autocorrelation sequence of the prediction coefficients. The pitch is

detected by finding the peak of the normalized sequence 𝑟𝑒(𝑛)

𝑟𝑎 (𝑛) in the time interval corresponds to

3 to 15 ms in the 20ms sampling frame. The value of this peak is at least 0.25, the frame of

speech is considered voiced with a pitch period equal to the value of 𝑛 = 𝑁𝑝 , where 𝑟𝑒(𝑁𝑝 )

𝑟𝑒(0) is a

maximum value. If the peak value is less than 0.25, then, the frame speech is considered as

unvoiced and the pitch would equal to zero. [1]

The value of the LPC coefficients, the pitch period, and the type of excitation are then

transmitted to the receiver. The decoder synthesizes the speech signal by passing the proper

excitation through the all pole filter model of the vocal tract.

Typically the pitch period requires 6 bits, the gain parameters are represented in 5 bits

after the dynamic range was compressed logrithmaticaly, and the prediction coefficients require

8-10 bits normally for accuracy reasons. This is very important in LPC because any small

changes in the prediction coefficients result in large change in the pole positions of the filter

model, which cause instability in the model. [1]

If the speech frame is decided to be voiced, an impulse train is employed to represent it,

with nonzero taps occurring every pitch period. A pitch-detecting algorithm is used in order to

determine to correct pitch period / frequency. The autocorrelation function is used to estimate

the pitch period. However, if the frame is unvoiced, then white noise is used to represent it and a

pitch period of T=0 is transmitted. Therefore, either white noise or impulse train becomes the

excitation of the LPC synthesis filter.

3.3.1 The Levinson–Durbin Algorithm

Levinson–Durbin recursion is a procedure in linear algebra to recursively calculates the

solution to an equation involving a Toeplitz matrix. It solves Ax = b, in which A is a Toeplitz

matrix, symmetric and positive definite; and b is an arbitrary vector. Durbin published a slightly

more efficient algorithm and his algorithm is known as the Levinson-Durbin recursive algorithm.

The Levinson-Durbin algorithm needs a special form of b, where b, consists of some elements of

A. [11]

Let a k (m) be the kth coefficient for a particular frame in the mth iteration. The Levinson-

Durbin algorithm solves the following set of ordered equations recursively, for 𝑚 = 1,2, …… , 𝑝

𝑘 𝑚 = 𝑅 𝑚 − 𝑎𝑘 𝑚 − 1 𝑅 𝑚 − 𝑘

𝑚−1

𝑘=1

𝑎𝑚 𝑚 = 𝑘(𝑚)

𝑎𝑘 𝑚 = 𝑎𝑘 𝑚 − 1 − 𝑘 𝑚 𝑎𝑚−𝑘 𝑚 − 1 , 1 ≤ 𝑘 < 𝑚

𝐸 𝑚 = (1 − 𝑘 𝑚 2𝐸(𝑚)

Where initially E (0) = R (0) and a (0) = 0. At each iteration, the mth coefficient a (m) for

k = 1, 2, m describes the optimal pith order linear predictor; and the minimum error E (m) is

reduced by a factor of (1-k (m) 2

). Since E (m) (squared error) is never negative. 𝑘(𝑚) ≤ 1.

This condition on the reflection coefficient k (m) also guarantees that the roots of a (z) will be

inside the unit circle. Thus the LP synthesis filter H (z) (where H (z) = I/A (z)) will be stable.

And therefore, the correlation method guarantees the stability of the filter. [1]

3.3.2 Linear Predictive Coding (LPC) Vocoder

The LPC vocoder consists of two parts, an encoder and a decoder is shown in Figure 3.5.

At the encoder, a speech signal is divided into short-time segments. Each speech segment is

analyzed. The filter coefficients 𝑎𝑝and the gain G are determined at the LPC analyzer which

shown in Figure 3.6. The pitch detector detects whether the sound is voiced or unvoiced. If the

sound is voiced, the pitch period is determined by the pitch detector. Then, all parameters are

encoded into a binary sequence. [1]

At the decoder, the transmitted data is decoded and the signal generators generate

excitation signals, periodic pulses or white noise, depending on the voiced or unvoiced decision.

This excitation signal goes through the Autoregressive (AR) Model H (z) with 𝑎𝑝and G as the

filter parameters, and then a synthesized speech signal is produced at the output of the filter. [1]

Figure 3.5 Encoder and decoder for LPC vocoder

Figure 3.6: LPC analyzer

3.3.3 VOICE EXCITED LINEAR PREDICTION (VELP)

The main idea behind the voice-excitation is to avoid the imprecise detection of the pitch

and the use of an impulse train while synthesizing the speech. One should rather try to come up

with a better estimate of the excitation signal. Thus the input speech signal in each frame is

filtered with the estimated transfer function of LPC analyzer. This filtered signal is called the

residual. If this signal is transmitted to the receiver one can achieve a very good quality. The

tradeoff, however, is paid by a higher bit rate, although there is no longer a need to transfer the

pitch frequency and the voiced / unvoiced information. [1]

Figure 3.7: Block diagram of voice-excited LPC Vocoders

To achieve high compression rate, the discrete cosine transform (DCT) of the residual

signal could be employed. The DCT concentrates most of the energy of the signal in the first

few coefficients. Thus one way to compress the signal is to transfer only the coefficients, which

contain most of the energy. [1]

3.3.4 RESIDUAL-EXCITED LINEAR PREDICTIVE (RELP) VOCODER

The RELP vocoder uses LPC analysis for vocal tract modeling. Linear prediction error

(residual) signals are used for the excitation. There is no voiced/unvoiced detection or pitch

detection required. The RELP vocoder, which Un et al. proposed, encodes speech between 6 and

9.6 kbps depending on the quality of the synthesized speech desired. [1]

Using the residual signals as the excitation improves the quality of the synthesized speech

and makes it more natural than the basic LPC vocoders, because there are no miscalculations of

voiced/unvoiced sounds or miscalculation of pitches. The excitation signals of the RELP vocoder

are very close to the ones the vocal tract produces. In contrast, the excitation signals (periodic

pulses) of the basic LPC vocoder are completely artificial. However, the total encoding rate of

the RELP vocoder is larger than most of the other LPC-based vocoder systems. The RELP

vocoder needs to encode sequences of residual signals per segment, which is a large volume of

data, while several bits are needed to encode the voiced/unvoiced decision, pitch, and gain for

the other LPC systems.

3.4 MATLAB IMPLEMENTATION

The spectral analysis and synthesis was implemented using GUI Matlab code to perform

both FFT and LPC analysis and synthesis. Figures 3.8, 3.9, 3.10, 3.11, shows the flow charts

which gives a brief representation of these techniques.

Figure 3.8: Flow chart of the FFT analysis

Figure 3.9: flow chart of the FFT synthesis

Figure 3.10: Flow chart of the LPC analysis

Figure 3.11 Flow chart of the LPC synthesis

CHAPTER 4

IMPLEMENTATION AND RESULTS

4.1 INTRODUCTION

This chapter shows how the Matlab codes were implemented using the GUI (Guide user

interference). In appendix A, a brief description of the GUI was introduced and the code is

presented in appendix B.

The results obtained from the Matlab code were introduces and discussed. Also a brief

comparison between the FFT and LPC techniques was given in this chapter.

4.2 MATLAB IMPLEMENTATION USING GUI:

The analysis and synthesis were implemented in Matlab GUI code, the GUI was named

SAS (Speech Analysis and Synthesis) which shown in Figure 4.1.

Figure 4.1: GUI interference in Matlab “SAS”

Once SAS interference was opened a speech signal can be recorded by clicking on

“Record” button then it automatically saved as a wave file (.wav) to a folder that specified by the

user and also the user type the name to that speech signal. SAS allow specifying the time length

of the recorded signal by entering the desired time length in the edit text named “Enter your

recording time period _it is by default 5 seconds”; if the user doesn‟t specify the time length it

considered 5 seconds.

As an example the recorded signal for 2 second long was recorded. The speech is voice

of a woman saying “Hello” which may save in the computer as “Hello.wav”. The sampling

frequency is 44100. The signal and its spectrogram, which shows how the spectral density of a

signal varies with time, are shown below in Figure 4.2 and Figure 4.3 respectively.

Figure 4.2: The wave form of the recorded signal

In the spectrogram the high energy concentration areas are shown in red color. The color

intensity shows the magnitude of the short time Fourier transforms.

As was see from the signal plot at the start and end of the speech there is a silence and

background noise. The actual speech starts at around 0.09 second and end 1.2 seconds. However

every frame was analyze in order to make the system flexible for any other speech signal.

Button “play” allows the retrieve of any saved wave file, also once the file was chosen its

wave signal and spectrogram will be displayed. Button “exit” is a push button that pressed in

order to close SAS window.

Figure 4.3: The spectrogram of speech signal under consideration

4.3 SPEECH ANALYSIS AND SYNTHESIS TECHNIQUES:

The spectral analysis was implemented in the SAS code with its two types: FFT and

LPC. A popup menu was used to list the two types and to allow the user to choose one of them,

as shown in Figure 4.4

Figure 4.4 GUI interference in Matlab “SAS” with a popup menu

4.3.1 Short time Fourier transforms:

Once the entry “Short time Fourier transforms “ was chosen from the popup menu,

another window appeared to ask a user to select a wave file that has to be analyzed and

synthesized with the short time Fourier transform. For example if the signal “warn4.wav” which

already saved in the computer for a women voice was selected then it must specify the length of

the frames, say a number 20 was entered in the edit text field named” Enter the frame size

length” that means the signal will divided into equal sizes frames each of 20 ms length.

SAS was display the original signal as shown in Figure 4.5 and then uses waterfall

function to draw a 3D graph of the short time Fourier transforms for each frame of the signal, as

Shown in Figure 4.6.

Figure 4.5: the original speech “signal warn4.wav”

Figure 4.6: 3D representation of the short time Fourier transform for each frame

Then it synthesis the analyzed signal and display the signal after it has been compressed,

as shown in Figure 4.7.

Different types of windows could be used. The Rectangular window‟s side lobe peaks do

not decrease very much from the peak of the main lobe when compared with the other window

types under consideration. The side lobe also shows somewhat constant level. Therefore, it has

more gain because it does not attenuate the signals other than the ones at center of the analysis

window i.e. the signals outside of the main lobe are not attenuated very much.

The Hamming window‟s side lobe peaks are lower in magnitude than their main lobe

peaks when compared with the Hanning and the Rectangular windows side lobe peak and main

lobe peak difference. But they remain somewhat constant after that. Therefore, they do not

increase to attenuating the signals outside the main lobe very much. So, they have the second

largest gain.

The Hanning window side lobes at the beginning are greater than the magnitude of the

beginning side lobes of the hamming window. But, this is observed for the beginning of the side

lobe only. After that, the side lobes have tendency to decrease in magnitude attenuating the

signals outside and farther from the main lobe more and more. This characteristic of the Hanning

window causes its gain to be lower than the rectangular and hamming windows. Here, in SAS

code, rectangular window was used to its simplicity and its high gain.

Figure 4.7: compressed speech signal using the FFT

4.3.2 Linear predicative coding:

Linear predicative coding can be chosen from the popup menu of SAS, and then the wave

file was selected, the “warn4.wav” was assumed selected as before. SAS Perform the LPC

Analysis according to the following specifications:

Frame size: 30 milliseconds

LPC method: autocorrelation

LPC prediction order: 13

Frame time increment: 20 milliseconds (i.e. The LPC analysis is done starting every

20ms in time).

The LPC prediction order must be larger than 10 for any voiced signal in order to ensure

a good analysis for the signal and better synthesis for it.

Input signal was organized into N analysis frames and saved in matrix having N columns

and window size rows. Each Frame will contain window samples and will have an overlap of a

difference between window size and analysis step size.

Derive for each frame corresponding N-column matrices containing:

The LPC predictor coefficients (using the Matlab functions levinson)

The predictor gain G (computed from the LPC coefficients and the frame autocorrelation)

The linear prediction error (residual), which shown in Figure 4.8.

Pitch (a frame-by-frame estimate of the pitch of the signal, calculated by finding the peak

in the residual's autocorrelation for each frame).

Parcor (The parcor coefficients give the ratio between adjacent sections in a tubular

model of the speech articulators. There are L parcor coefficients for each frame of speech.)

Figure 4.8: The residual signal generated from LPC analyzer

Short time signal processing is usually done using windowing. Frames are windowed to

improve the frequency domain representation. The LPC predictor coefficients are found from the

windowed signal using the Matlab function levinson. Then LPC is used for transmitting

information of the spectral envelope. In LPC analysis, sound is assumed to be a result of an all

pole filter applied to a source with flat spectrum.

The predictor gain is found using the LPC coefficients and the correlation or

autocorrelation which refers to the cross correlation of a signal with itself describes the

redundancy in the signal. The autocorrelation is not very accurate but it guarantees stability.

The predictor gain G increases as the length of the analysis window increase. The length

of the analysis window specifies how much signal is used for calculation at each step it affects

the gain value and the time to obtain that value. Therefore, the predictor gain G increases as the

analysis window length increases and take more time means have delay in gaining that value.

Hence using large window size we can get large gain and we will be able to obtain an accurate

reproduction given input. But it also introduces delay and has computational complexity.

For the “warn4.wav” voice file that assumed selected ,three types of LPC vocoder were

implemented to compress the speech signal and enhancements the quality of that compressed

speech signal, the output from the three vocoders is shown below in Figure 4.9 along with the

waveform of the original signal.

The “warn4.wav” saved as a vector 88200 × 1 size, but after LPC analysis only its LPC

parameters will transmit which had a smaller size than that of the original signal as illustrated in

table 4.1.

The parameter The size

The LPC predictor coefficients 14 × 99

PARCOR 13 × 99

Pitch 1 × 99

The residual signal 1323 × 99

the Gain 1 × 99

Table 4.1: The matrix size of the different LPC parameters in the case of “Hello.wav” signal

In addition to the compression obtained from the reduction in the coefficient size the

voiced excited and residual voiced excited method reduce the number of bits using the

quantization method which applied after the analysis to reduce the number of bits in the residual

signal.

Figure 4.9 the resulted LPC compressed signals along with the original signal

SAS allow listening to the resulted voice in order to compare between them, it show the

following messages in the edit text named” the output text”:

“Press any key to play the original sound file”

“Press any key to play the LPC compressed file!”

“Press a key to play the voice-excited LPC compressed sound!”

“'Press a key to play the randomly-excited LPC compressed sound”

Then the corresponding sound played when the key was pressed. The voiced-excited LPC

compressed sound had the best quality sound than the other two; on the other hand, plain LPC

compressed sound has a better quality than the voiced excited compressed speech. On the other

hand, the residual excited LPC has a lower bandwidth and the higher data rate because it

generate the residual signal internally and doesn‟t need to determine if the speech signal is

voiced or unvoiced.

Nonetheless the voice-excited LPC used gives understandable results and is not optimized.

The tradeoffs between quality on one side and bandwidth and complexity on the other side

clearly appear here. Since the voice-excited LPC gives pretty good results with all the required

limitations of this project.

The difference between the three types lies in the nature of the residual signal that used to

synthesis the compressed speech. The plain LPC used the residual signal that generated from the

LPC analysis function, which shown in Figure 4.8 The voice-excited LPC use the discrete

cosine transform (DCT) of the residual signal then add some noise to the residual signal before

used it in the synthesis part, the new residual signal was shown in Figure 4.10 below.

Figure 4.10: The residual signal that used in the voiced-exited LPC after it had DCT

transform and noise was added to it

Randomly excited LPC compressed signal was obtained by defining the residual signal as

a random signal, like that shown in Figure 4.11.

Figure 4.11: Random residual signal

4.4 COMPRESSION BETWEEN FFT AND LPC TECHNIQUES:

The two methods for spectral analysis, the fast Fourier transforms (FFT) and linear

prediction (LPC) were implemented. FFT finds the energy distribution in the actual speech

sound, whereas LPC estimates the vocal tract filter that shaped that speech.

The advantage of FFT is that it has easier setup; the disadvantage is the difficulty in

identifying formants by speakers with higher pitched voices. LPC has better success with high-

pitched voices, but the settings need to be carefully tuned for each speaker. Hence LPC is

always more successful than FFT at analyzing the spectral properties of female or child speech

since it is not sensitive to higher voice fundamental frequencies.

The FFT is suitable for steady-state wave signals with long waveform data length; on the

other hand, The LPC is suitable for transient signals with short waveform data length. [18]

CHAPTER 5

CONCLUSION AND FUTURE WORK

5.1 CONCLUSION

The speech signal was compressed using both the FFT and LPC methods implemented

with Matlab code that represented in Appendix B, LPC method was implemented using three

types of vocoders, plain LPC vocoder, voice excited vocoder and residual excited vocoder. And

it‟s found that a tradeoff between the quality of the speech signal and the size of the bandwidth

must be applied between the three types depending on the type of the application in use. Hence

the voice excited LPC has the best quality compressed sound and the residual excited LPC has

the smallest bandwidth. Also a brief comparison between the two techniques FFT and LPC was

introduced in chapter.4, it showing that each one of them has its own advantages and

disadvantages.

The degree of compression which obtained is highly efficient as seen in chapter 4 and

contain two level of compression one in the size of the parameters and the other on the

performed using Quantization, unfortunately, direct quantization of the speech signal, resulting

in the quantized signal, may degrade the quality of the speech signal to an unacceptable degree.

The recovering of the original signal from the analyzed signal is not perfect due to many

factors; one of those factors is that, in order to perform good analysis the signal was divided into

frames, however some overlapping was occurred between frames which is difficult to be

detected exactly in the synthesized signal.

SAS GUI was implemented to make it easy to deal with the code even for those who are

not familiar with Matlab. And obtain the analysis and synthesis results using less effort.

5.1 FUTURE WORK

The Matlab code could be extended to include other analysis and synthesis techniques,

such as Homomorphic techniques, formant analysis and synthesis and any other techniques

which discussed in chapter.2.

The GUI SAS could be improved to be an advance voice toolbox like the famous

COLEA toolbox for example. And some additional features could be added to the GUI SAS in

order to give the user more control. For example the user may allow choosing the type of

windowing in the FFT analysis, from many options like hamming window, hanning window or

rectangular window, the latter is the default windowing used in GUI SAS.

In the future, the result of this project could be implemented as a voice toolbox in

communication systems to improve and control the quality of the speech signals and provided

more bandwidth for the network operators.

REFERENCES [1] WAI C. CHU,” SPEECH CODING ALGORITHMS, Foundation and Evolution of Standardized

Coders”, Mobile Media Laboratory, DoCoMo USA Labs, San Jose, California.

[2]L. R. Rabiner and R. W. Schafer, “Theory and Application of Digital Speech Processing Preliminary

Edition”.

[3] James L. Flanagan Jont B. Allen Mark A. Hasegawa-Johnson, “Speech Analysis Synthesis and

Perception” Third Edition,

[4] Source–filter model of speech production, Wikipedia.org. Article at:

http://en.wikipedia.org/wiki/Source–filter_model_of_speech_production

Last seen: 10 July 2013

[5] Speech synthesis, Wikipedia.org. Article at: http://en.wikipedia.org/wiki/Speech_synthesis

[6] Speech Analysis and Synthesis by Linear Prediction of the Speech Wave, B. S. ATAL AND

SUZANNE L. HANAUER, Bell Telephone laboratories, Incorporated, Murray Hill, new Jersey 07974

[7] Homomorphic Processing of Speech, article at:

http://isites.harvard.edu/fs/docs/icb.topic541812.files/lec11_spr09.pdf

[8] Speech Analysis-Synthesis system based on Homomorphic Filtering, ALAN V. Oppenheim

[9] Speech formant frequency estimation: evaluating a nonstationary analysis method, Preeti Rao, A. Das

Barman

[10] Pitch analysis, Article at:

http://design-marketing-dictionary.blogspot.com/2009/11/voice-pitch-analysis.html

[11]Levinson Durbin Algorithm, Wikipedia.org. Article at: http://en.wikipedia.org/wiki/Levinson-

Durbin_algorithm

[12] Pitch, britannica.com, article at:

http://www.britannica.com/EBchecked/topic/1357164/pitch

[13] Speech Synthesis, article at:

http://www.cs.tut.fi/courses/SGN-4010/puhesynteesi_en.pdf

[14] SPECTRAL ENVELOPES AND INVERSE FFT SYNTHESIS, X. Rodet* & P. Depalle

[15] Window function, Wikipedia.org, article at:

http://en.wikipedia.org/wiki/Window_function

[16] Vocoder, Article at: http://wiki.radioreference.com/index.php/Vocoder

[17] C. K. UN and D. T. Magill, "The Residual-Excited Linear Prediction Vocoder with Transmission

Rate Below 9.6 Kbits/s," IEEE Transactions on Communications, Vol. COM-23, No. 12, pp. 1466-1474,

[18] LPC SPECTRUM ANALYSIS, Article at:

http://www.soundid.net/SoundID/Papers/SpectrumAnalyzerPDF.pdf

[19]MATLAB, mathwork.com, “Creating Graphical User Interfaces”.

Last seen 16 July 2013

APPENDIX A GUI

APPENDIX A

GRAPHICAL USER INTERFERENCE (GUI)

A graphical user interface (GUI) is a graphical display in one or more windows

containing controls, called components that enable a user to perform interactive tasks.

The user of the GUI does not have to create a script or type commands at the command

line and the user need not to understand the details of code. [19]

MATLAB GUIs could build in two ways:

1. Use GUIDE (GUI Development Environment), an interactive GUI construction kit.

2. Create code files that generate GUIs as functions or scripts (programmatic GUI

construction).

The code in appendix B was written using the GUIDE. When a GUI was saved,

GUIDE creates two files, a FIG-file and a code file. The FIG-file, with extension .fig, is a

binary file that contains a description of the layout. The code file, with extension .m,

contains MATLAB functions that control the GUI.

The layout editor, which shown in Figure A.1, used to select components from

the component palette, which reside at the left side of the Layout Editor, and arrange

them in the layout area. Some of tools which shown in the Figure A.1 was tabulated in

Table A.1 along with their uses. [19]

The tool Its uses

Figure Resize Tab Set the size at which the GUI is initially displayed when it

Menu Editor Create menus and context

Align Objects Align and distribute groups of components. Grids and

rulers also enable to align components on a grid with an

APPENDIX A GUI

optional snap-to-grid capability.

Property Inspector Set the properties of the components in GUI layout. It

provides a list of all the properties that can set and

displays their current values.

Object Browser Display a hierarchical list of the objects in the GUI.

Run Save and run the current GUI

Table A.1: GUI tools Summary

APPENDIX A GUI

Figure A.1: GUIDE layout editor

The component palette at the left side of the Layout Editor contains the

components that can add to the GUI. Some of them which used in the GUI SAS where

described in table.A.2. [19]

Component Description

push button Generate an action when clicked.

Edit text are fields that enable users to enter or modify text strings

APPENDIX A GUI

static text Controls display lines of text.

popup menu Pop-up menus open to display a list of choices when

users click the arrow.

Table A.2: Guide components that have been used with a short description for each

of them

The Guide code controls how the GUI responds to events. Events include button

clicks, slider movements, menu item selections, and the creation and deletion of

components. This programming takes the form of a set of functions, called callbacks, for

each component and for the GUI figure itself. [19]

A callback is a function that had been written and associate with a specific GUI

component or with the GUI figure.[19]

APPENDIX B THE MATLAB CODE

APPENDIX B

THE MATLAB CODE

function varargout = sas(varargin)

% SAS Application M-file for sas.fig

% FIG = SAS launch sas GUI.

% SAS('callback_name', ...) invoke the named callback.

% Last Modified by GUIDE v2.0 15-Jul-2013 04:35:49

if nargin == 0 % LAUNCH GUI

fig = openfig(mfilename,'reuse');

% Use system color scheme for

figurset(fig,'Color',get(0,'defaultUicontrolBackgroundColor'));

% Generate a structure of handles to pass to callbacks, and store it.

handles = guihandles(fig);

guidata(fig, handles);

if nargout > 0

varargout{1} = fig;

elseif ischar(varargin{1}) % INVOKE NAMED SUBFUNCTION OR CALLBACK

if (nargout)

[varargout{1:nargout}] = feval(varargin{:}); % FEVAL switchyard

feval(varargin{:}); % FEVAL switchyard

disp(lasterr);

% -----------------------------( Record )---------------------------------------

function varargout = record_Callback(h, eventdata, handles, varargin)

fs = 44100;

p=handles.edit11;

y = wavrecord(p*fs,fs);

[filename, pathname] = uiputfile('*.wav', 'Pick an M-file');

cd (pathname);

wavwrite(y,fs,filename);

sound(y,fs);

handles.x = y;

handles.fs = fs;

figure(1)

time = 0:1/fs:(length(handles.x)-1)/fs;

plot(time,handles.x);

title('Original Signal ');

figure(2)

specgram(handles.x, 1024, handles.fs);

title(' Spectrogram of Original Signal ');

guidata(h, handles);

% ----------------------( Exit )----------------------------------------------

function varargout = exit_Callback(h, eventdata, handles, varargin)

cl = questdlg(' Do you want to EXIT?',' EXIT ',...

' Yes ',' No ',' No ');

switch cl

case 'Yes'

close(sas);

clear all;

return;

case 'No'

quit cancel;

% ----------------------------( Play )----------------------------------------

function varargout = play_Callback(h, eventdata, handles, varargin)

[FileName,PathName] = uigetfile({'*.wav'},' Load Wav File ');

[x,fs] = wavread([PathName '/' FileName]);

handles.x = x;

handles.fs = fs;

sound(x,fs);

figure(1)

time = 0:1/fs:(length(handles.x)-1)/fs;

plot(time,handles.x);

title(' Original Signal ');

figure(2)

specgram(handles.x, 1024, handles.fs);

title(' Spectrogram of Original Signal ');

guidata(ht,handles);

% --------------------------( Input the number of framed )------------------------------------------

function varargout = edit10_Callback(h, eventdata, handles, varargin)

user_entry = str2double(get(h,'string'));

if isnan(user_entry)

errordlg(' You must enter a numeric value','Bad Input','modal ')

uicontrol(h)

return

handles.user_entry = user_entry;

% ----------------------( Input the time length of the record )--------------------------------------

function varargout = edit11_Callback(h, eventdata, handles, varargin)

p = str2double(get(h,' string '));

if isnan(p)

errordlg(' You must enter a numeric value ',' Bad Input ',' modal ')

uicontrol(h)

return

handles.edit11 =p;

% -------------------------( Analysis And Synthesis techniques )----------------------------------

function varargout = sas_Callback(h, eventdata, handles, varargin)

val = get(h,'Value');

switch val

case 1

%------( FFT analysis )--------------

[FileName,PathName] = uigetfile({' *.wav '},' Load Wav File ');

[x,fs] = wavread([PathName '/' FileName]);

handles.x = x;

handles.fs = fs;

t=0:1/fs:(length(x)-1)/fs; %and get sampling frequency

figure(1)

plot(t,x); % graph it.

legend(' The orignal speech signal ');

xlabel(' The time (t) ');

ylabel(' Signal amplitude (x) ');

wl = handles.user_entry

guidata(h,handles);

L=length(x);

if L<wl

z=wl-L;

x=[x,zeros(1,z)];

win=ones(1,wl); % Rectangular Window

hop=ceil(wl/2); % Hoop size of window

if hop<1

hop=wl;

i=1;str=1; len=wl; X=[];

while ((len<=L) | (i<2))

if i==1

if len>L % If window size excceds the L of signal for 1st time

z=len-L;

x=[x,zeros(1,z)]; % padding zeros

i=1+1;

x1=x(str:len);

%wl=wl*[0:(wl-1)].';

size(wl);

size(x1);

X=[X;fft(x1.*win)]; % Matrix mul

str=str+hop; len=str+wl-1; % to make window overlapping

figure ,

waterfall(abs(X)); title('Result by waterfall function')

%--------------( The FFT synthsis )--------------------------

fs = 44100;

ftsize = (size(X,2));

s = size(X);

cols = s(1);

% special case: rectangular window

win = ones(1,ftsize);

w = length(win);

% now can set default hop

h = floor(w/2);

xlen = ftsize + (cols-1)*h

y = zeros(1,xlen);

for b = 0:h:(h*(cols-1))

ft = X(1+b/h,:)';

px = real(ifft(ft));

y((b+1):(b+ftsize)) = y((b+1):(b+ftsize))+px'.*win;

t=0:1/fs:(length(y)-1)/fs; %and get sampling frequency

figure ,

plot(t,y); % graph it.

title(' The compressed speech signal ')

case 2

%------------( LPC analysis and synthesis)---------

[FileName,PathName] = uigetfile({' *.wav '},' Load Wav File ');

[inspeech,Fs] = wavread([PathName ' / ' FileName]);

handles.inspeech = inspeech;

handles.Fs = Fs;

outspeech1 = speech1(inspeech,Fs);

outspeech2 = speech2(inspeech,Fs); % Voice excitded LPC vocoder

outspeech3 = speech3(inspeech,Fs); % Voice excitded LPC vocoder

% plot results

figure(1);

subplot(4,1,1);

plot(inspeech);

legend(' The original speech signal ');

subplot(4,1,2);

plot(outspeech1);

legend(' The LPC compressed speech signal ');

subplot(4,1,3);

plot(outspeech2);

legend(' The voice-excited LPC compressed speech signal ');

subplot(4,1,4);

plot(outspeech3);

legend(' The randomly-excited LPC compressed speech signal ');

s1=sprintf(' Press any key to play the original sound file ');

set(handles.edit14,' string ',s1);

pause;

soundsc(inspeech, Fs);

s2=sprintf(' Press any key to play the LPC compressed file! ');

pause;

soundsc(outspeech1, Fs);

s3=sprintf(' Press a key to play the voice-excited LPC compressed sound! ');

pause;

s4=sprintf(' Press a key to play the randomly-excited LPC compressed sound! ');

pause;

s5=sprintf(' The output text ... ');

% --------------------------------------------------------------------------

function [ outspeech ] = speech1( inspeech , Fs, Order)

% (coded and resynthesized)

if ( nargin < 1)

error('argument check failed');

if(nargin<2)

Fs = 16000; % sampling rate in Hertz (Hz)

if (nargin<3)

Order = 10; % order of the model used by LPC

% encoded the speech using LPC

[aCoeff, resid, pitch, G, parcor] =lpc_analysis(inspeech, Fs, Order);

figure,

plot(resid);

% decode/synthesize speech using LPC and impulse-trains as excitation

outspeech = lpcsyn(aCoeff, pitch, Fs, G);

%------------------------------------------------------------------------------------

function [ outspeech] = speech2( inspeech,Fs )

if ( nargin ~= 2)

[aCoeff, resid, pitch, G, parcor] = lpc_analysis(inspeech, Fs, Order);

% perform a discrete cosine transform on the residual

resid = dct(resid);

[a,b] = size(resid);

% only use the first Nr (i.e. 50) DCT-coefficients this can be done

% because most of the energy of the signal is conserved in these coeffs

Nr=180;

resid = [ resid(1:Nr,:); zeros(a-Nr,b) ];

% quantize the data using Nq (i.e. 4) bits

resid = uencode(resid,Nq);

resid = udecode(resid,Nq);

% perform an inverse DCT

resid = idct(resid);

% add some noise to the signal to make it sound better

noise = [ zeros(Nr,b); 0.01*randn(a-Nr,b) ];

resid = resid + noise;

figure,

plot(resid);

title(' the Residual signal plus noise ' );

% decode/synthesize speech using LPC and the compressed residual as excitation

outspeech = lpcsyn2(aCoeff, resid, Fs, G);

%----------------------------------------------------------------------------------

function [ outspeech ] = speech3( inspeech,Fs )

if ( nargin ~= 2)

[aCoeff, resid, pitch, G, parcor] = lpc_analysis(inspeech, Fs, Order);

[a,b] = size(resid);

% use random siganl as residal

resid= 0.01*randn(a,b);

figure,

plot(resid);

title(' random residual signal ');

% decode/synthesize speech using LPC and the compressed residual as excitation

outspeech = lpcsyn2(aCoeff, resid, Fs, G);

%--------------------------------------------------------------------------

function [aCoeff,resid,pitch,G,parcor] = lpc_analysis(data,sr,L,fr,fs,preemp)

%This function computes the LPC (linear-predictive coding) coefficients that describe a

% speech signal.

% L - The order of the analysis. There are L+1 LPC coefficients in the output

% array aCoeff for each frame of data.

% fr - Frame time increment, in ms. The LPC analysis is done starting every

% fr ms in time. Defaults to 20ms (50 LPC vectors a second)

% fs - Frame size in ms. The LPC analysis is done by windowing the speech

% data with a rectangular window that is fs ms long. Defaults to 30ms

% preemp - This variable is the epsilon in a digital one-zero filter which

% serves to preemphasize the speech signal and compensate for the 6dB

% per octave rolloff in the radiation function. Defaults to .9378.

% The output variables from this function are aCoeff - The LPC analysis results, a(i).

%One column of L numbers for each frame of data

% resid - The LPC residual, e(n). One column of sr*fs samples representing

% the excitation or residual of the LPC filter.

% pitch - A frame-by-frame estimate of the pitch of the signal, calculated

% by finding the peak in the residual's autocorrelation for each frame.

% G - The LPC gain for each frame.

% parcor - The parcor coefficients. The parcor coefficients give the ratio

% between adjacent sections in a tubular model of the speech

% articulators. There are L parcor coefficients for each frame of speech.

if (nargin<3), L = 13; end

if (nargin<4), fr = 20; end

if (nargin<5), fs = 30; end

if (nargin<6), preemp = .9378; end

[row col] = size(data);

if col==1 data=data'; end

nframe = 0;

msfr = round(sr/1000*fr); % Convert ms to samples

msfs = round(sr/1000*fs); % Convert ms to samples

duration = length(data);

speech = filter([1 -preemp], 1, data)'; % Preemphasize speech

msoverlap = msfs - msfr;

ramp = [0:1/(msoverlap-1):1]'; % Compute part of window

for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms

frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms

nframe = nframe+1;

autoCor = xcorr(frameData); % Compute the cross correlation

autoCorVec = autoCor(msfs+[0:L]);

% Levinson's method

err(1) = autoCorVec(1);

k(1) = 0;

A = [];

for index=1:L

numerator = [1 A.']*autoCorVec(index+1:-1:2)';

denominator = -1*err(index);

k(index) = numerator/denominator; % PARCOR coeffs

A = [A+k(index)*flipud(A); k(index)];

err(index+1) = (1-k(index)^2)*err(index);

aCoeff(:,nframe) = [1; A];

parcor(:,nframe) = k';

gain=0;

cft=0:(1/255):1;

for index=1:L

gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;

gain = abs(1./gain);

spec(:,nframe) = 20*log10(gain(1:128))';

Figure(3);

plot(20*log10(gain))

title(nframe)

drawnow;

% Calculate the filter response from the filter's impulse response (to check above).

impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);

freqResp = 20*log10(abs(fft(impulseResponse)));

plot(freqResp);

errSig = filter([1 A'],1,frameData); % find excitation noise

G(nframe) = sqrt(err(L+1)); % gain

autoCorErr = xcorr(errSig); % calculate pitch & voicing information

[B,I] = sort(autoCorErr);

num = length(I);

if B(num-1) > .01*B(num)

pitch(nframe) = abs(I(num) - I(num-1));

pitch(nframe) = 0;

% calculate additional info to improve the compressed sound quality

resid(:,nframe) = errSig/G(nframe);

if(frameIndex==1) % add residual frames using a trapezoidal window

stream = resid(1:msfr,nframe);

stream = [stream];

overlap=resid(1:msoverlap,nframe).*ramp;resid(msoverlap+1:msfr,nframe);

if(frameIndex+msfr+msfs-1 > duration)

stream = [stream; resid(msfr+1:msfs,nframe)];

overlap = resid(msfr+1:msfs,nframe).*flipud(ramp);

stream = filter(1, [1 -preemp], stream)';

%---------------------------------------------------------------------------------------

function synWave = lpcsyn(aCoeff,pitch,sr,G,fr,fs,preemp)

% This function synthesizes a (speech) signal based on a LPC (linear-

% predictive coding) model of the signal..

% LPC synthesis produces a monaural sound vector (synWave) which is

% sampled at a sampling rate of "sr". The following parameters are mandatory

% aCoeff - The LPC analysis results, a(i). One column of L+1 numbers for each

% frame of data. The number of rows of aCoeff determines L.

% The following parameters are optional and default to the indicated values.

if (nargin < 5), fr = 20; end;

if (nargin < 6), fs = 30; end;

if (nargin < 7), preemp = .9378; end;

msfs = round(sr*fs/1000); % framesize in samples

msfr = round(sr*fr/1000); % framerate in samples

ramp = [0:1/(msoverlap-1):1]';

[L1 nframe] = size(aCoeff); % L1 = 1+number of LPC coeffs

for frameIndex=1:nframe

A = aCoeff(:,frameIndex);

% first check if it is voiced or unvoiced sound:

if ( pitch(frameIndex) ~= 0 )

t = 0 : 1/sr : fs*10^(-3); % sr sample freq. for fr ms

d = 0 : 1/pitch(frameIndex) : 1; % 1/pitchfreq. repetition freq.

residFrame = 0.1*(pulstran(t, d, 'tripuls',1/pitch(frameIndex)))';

% sawtooth width %of 0.001s

residFrame = residFrame + 0.01*randn(msfs+1,1);

residFrame = [];

for m = 1:msfs

residFrame = [residFrame; randn];

end % for

synFrame = filter(G(frameIndex), A', residFrame); % synthesize speech from LPC coeffs

if(frameIndex==1) % add synthesize frames using a trapezoidal window

synWave = synFrame(1:msfr);

synWave = [synWave;

overlap+synFrame(1:msoverlap).*ramp;synFrame(msoverlap+1:msfr)];

if(frameIndex==nframe)

synWave = [synWave; synFrame(msfr+1:msfs)];

overlap = synFrame(msfr+1:msfs).*flipud(ramp);

%-------------------------------------------------------------------------------------------

function synWave = lpcsyn2(aCoeff,resid,sr,G,fr,fs,preemp)

% This function synthesizes a (speech) signal based on a LPC (linear-

% predictive coding) model of the signal. Used with the voice excited and randomly

%excited LPC.

% LPC synthesis produces a monaural sound vector (synWave) which is

% sampled at a sampling rate of "sr". The following parameters are mandatory

% aCoeff - The LPC analysis results, a(i). One column of L+1 numbers for each

% frame of data. The number of rows of aCoeff determines L.

% The following parameters are optional and default to the indicated values.

if (nargin < 5), fr = 20; end;

if (nargin < 6), fs = 30; end;

if (nargin < 7), preemp = .9378; end;

msfs = round(sr*fs/1000); % framesize in samples

msfr = round(sr*fr/1000); % framerate in samples

ramp = [0:1/(msoverlap-1):1]';

[L1 nframe] = size(aCoeff); % L1 = 1+number of LPC coeffs

for frameIndex=1:nframe

A = aCoeff(:,frameIndex);

residFrame = resid(:,frameIndex);

synFrame = filter(G(frameIndex), A', residFrame); % synthesize speech from LPC coeffs

if(frameIndex==1) % add synthesize frames using a trapezoidal window

synWave = synFrame(1:msfr);

synWave = [synWave;

overlap+synFrame(1:msoverlap).*ramp;synFrame(msoverlap+1:msfr)];

if(frameIndex==nframe)

synWave = [synWave; synFrame(msfr+1:msfs)];

overlap = synFrame(msfr+1:msfs).*flipud(ramp);

% --------------------------------------------------------------------

SPEECH ANALYSIS AND SYNTHESIS By Faiga Mohammed ...

Documents