Post on 03-May-2023
transcript
SPEECH ANALYSIS AND SYNTHESIS
By
Faiga Mohammed Mohammed Ahmed Alawd
INDEX NO. 084057
Supervisor
Ostaz Mohammed Jaafr Elnourani
A thesis submitted in partial fulfillment for the degree of
B.Sc. (HON)
To the Department of Electrical and Electronic Engineering
(COMMUNICATION ENGINEERING)
Faculty of Engineering
University of Khartoum
July 2013
ii
DICLARATION OF ORIGINALITY
I declare that this report entitled “SPEECH ANALYSIS AND SYNTHESIS” is my
own work except as cited in the references. The report has not been accepted for any degree and
is not being submitted concurrently in candidature for any degree or other award.
Signature : _________________________
Name : _________________________
Date : _________________________
iii
ABSTRACT
Speech signals are important signals in the communication systems. It must analyze in
order to obtain its important parameters and to compress it to make the maximum use of the
available bandwidth.
Speech analysis and synthesis has many techniques, each one of them had its own
advantages and disadvantages. This project seeks to investigate some of speech analysis and
synthesis techniques and make a comparison between those techniques
In this project, spectral analysis and synthesis with its two types, FFT technique and LPC
techniques was implemented using Matlab code and Matlab GUI to display the results of
analysis and synthesis, the GUI was named SAS abbreviations to Speech Analysis and Synthesis.
iv
المستخلص
هم اإلشارات المستخدمة فى نظم اإلتصاالت لذلك يجب تحليلها لكى نتحصل على المعامالت أمن حديثإشارة ال
المهمة منها و كذلك لكى نستطيع أن نضغطها للحصول على اإلستخدام األمثل لعرض النطاق المتوفر.
له تقنيات عديدة لكل منها محاسنه و مثالبه . هذا المشروع يبحث عن التعرف على بعض حديثتركيب التحليل و
و صنع مقارنة بين هذه األنواع المختلفة. حديثأنواع تحليل و تكريب ال
من نوع الطيف الترددى بأنواعه اإلثنين المتمثلين فى تحويل حديثفى هذا المشروع تم دراسة تحليل و تركيب ال
عوامل المستنتجة خطياً بإستخدام برنامج الماتالب من ثم تم عرض النتائج باستخدام واجه دليل المستخدم التى توجد فوريير وال
.حديثفى برنامج الماتالب. هذه الواجهة سميت باألحرف األولى للمعنى العام للمشروع و هو تحليل و تركيب ال
v
ACKNOWLEDGEMENT
Unlimited praises for Allah as the number of his creatures, the gratification of himself,
the weight of his throne, and the extension of his words.
I would like to thank a few persons who have contributed to the completion of the following
thesis. Firstly, I would like to thank my advisor, Ostaz Mohammed Jaafer Alnourani for his constant
guidance and motivation. He always available to discuss my thesis, edit my manuscript and offer
encouragement. This thesis would not have been possible without him. I would also like to thank my
parents, and my extended family for their support and motivation while writing my thesis.
Also I specialize many thanks for my friend and partner Randa Abd Almonem Sabir , we
were destined to meet and become friends. And to my sister Wefag Mohammed, she always
there for me when I need her.
It was a great pleasure having the opportunity to work with and learn from such high
quality teachers in the University of Khartoum faculty of engineering. They taught me that in
order to be a good engineer you have to be a good person in the first place.
vi
DEDICATION
To the person who taught me the true meaning of the true love, the meaning of giving for infinity
without limits,
To my mother
Suad Al-Shekh Taha
vii
TABLE OF CONTENTS
TITLE i
DICLARATION OF ORIGINALITY ii
ABSTRACT iii
iv المستخلص
ACKNOWLEDGEMENT v
DEDICATION vi
TABLE OF CONTENTS vii
LIST OF FIGURES x
LIST OF TABLES xi
LIST OF ABBREVIATIONS AND TERMINOLOGIES xii
CHAPTER 1: INTRODUCTION 1
1.1. Introduction 1
1.2. MOTIVATION 1
1.3. Problem statement 1
1.4. Objectives 2
1.5. Methodology 2
1.6. Thesis layout 2
CHAPTER 2: LITERATURE REVIEW 4
2.1. Introduction 4
2.2. Speech Signals
2.2.1 Origin of speech signals
2.2.2 Classification of Speech Signals
4
4
5
2.3. Speech communication 6
2.4. Speech analysis
2.4.1 Spectral Analysis of Speech
7
7
viii
2.4.2 Homomorphic Analysis
2.4.3 Formant analysis
2.4.4 Analysis of Voice Pitch
8
10
10
2.5. Speech synthesis
2.5.1 Synthesizer technologies
2.5.1.1 Concatenative synthesis
2.5.1.2 Formant synthesis
2.5.2 Text to speech
11
11
11
12
13
2.6. Similar papers and researches
2.6.1 Speech Analysis and Synthesis by Linear Prediction of the
Speech Wave
2.6.2 SPECTRAL ENVELOPES AND INVERSE FFT
SYNTHESIS
14
14
15
CHAPTER 3: MOTHODOLOGY 16
3.1. Introduction 16
3.2. FFT method 16
3.2.1. Short-Time Frequency Analysis
3.2.2. Choice of the weighting function
16
16
3.3. LPC method 18
3.3.1 The Levinson–Durbin Algorithm
3.3.2 Linear Predictive Coding (LPC) Vocoder
3.3.3 VOICE EXCITED LINEAR PREDICTION (VELP)
3.3.4 RESIDUAL-EXCITED LINEAR PREDICTIVE
(RELP) VOCODER
22
22
23
24
3.4. Matlab implementation 24
CHAPTER 4: IMPLEMENTATION AND RESULTS 29
4.1. Introduction 29
4.2. MATLAB IMPLEMENTATION USING GUI 29
4.3. SPEECH ANALYSIS AND SYNTHESIS TECHNIQUES: 31
4.3.1. Short time Fourier transforms 32
4.3.2. Linear predicative coding 34
ix
4.4. COMPRESSION BETWEEN FFT AND LPC TECHNIQUES 39
CHAPTER 5: CONCLUSION AND FUTURE WORK 40
5.1. Conclusion 40
5.2. Future work 40
REFERENCES
APPENDIX A
APPENDIX B
42
x
LIST OF FIGURES Figure Number
Figure 2.1
Title
Diagram of the human speech production system
Page
4
Figure 2.2 Source-Filter Model of Speech 6
Figure 2.3 Homomorphic analyzer 9
Figure 2.4 Overview of a typical TTS system 14
Figure 2.5 Block diagram of the speech synthesizer by Linear Prediction 15
Figure 3.1 The rectangular window 17
Figure 3.2 The hamming window 18
Figure 3.3 the hanning window 18
Figure 3.4 simple speech production 19
Figure 3.5 Encoder and decoder for LPC vocoder 23
Figure 3.6 LPC analyzer 23
Figure 3.7 Block diagram of voice-excited LPC Vocoders 24
Figure 3.8 Flow chart of the FFT analysis 25
Figure 3.9 flow chart of the FFT synthesis 26
Figure 3.10 Flow chart of the LPC analysis 27
Figure 3.11 Flow chart of the LPC synthesis 28
Figure 4.1 GUI interference in Matlab “SAS” The wave form of the recorded
signal
92
Figure 4.2 The wave form of the recorded signal 03
Figure 4.3 The spectrogram of speech signal under consideration 03
Figure 4.4 GUI interference in Matlab “SAS” with a popup menu. 09
Figure 4.5
Figure 4.6
the original speech signal “warn4.wav”
3D representation of the short time Fourier transform for each
frame
33
33
Figure 4.7 compressed speech signal using the FFT 34
Figure 4.8 The residual signal generated from LPC analyzer. 35
Figure 4.9 the resulted LPC compressed signals along with the original signal 37
Figure 4.10
The residual signal that used in the voiced-exited LPC after it had
DCT transform and noise was added to it
38
Figure 4.11 Random residual signal 39
Figure A.1 GUIDE layout editor A.1
xi
LIST OF TABLES
Table Number
Table 4.1
Title
The matrix size of the different LPC parameters in the case of
“Hello.wav” signal.
Page
36
Table A.1 GUI tools Summary. A.2
Table A.2 Guide components that have been used with a short description for
each of them
A.4
`
xii
LIST OF ABBREVIATIONS
LPC Linear predicative coding
FFT Fast Fourier transform
DSP Digital signal processing
TTS Text to speech
FFT-1
inverse Fast Fourier transform
IIR infinite impulse response
LP linear predictive
AR Autoregressive
VELP voiced excited linear predicative
DCT discrete cosine transform
RELP residual excited linear predicative
SAS Speech analysis and synthesis
GUI guide user interference
1
CHAPTER 1
INTRODUCTION
1.1 NTRODUCTION
This chapter was written in order to engage the reader with the ideas behind this thesis. It
consists of the project background, problem statement, objectives, methodology, and project
scope. In addition, an overview of report layout was presented
1.2 MOTIVATION
`Accompanying the explosive growth of the Internet is the growing need for audio
compression, or data compression in general. The major goal in audio compression is to
compress the audio signal, either for reducing transmission bandwidth requirements or for
reducing memory storage requirements, without sacrificing quality.
Digital cellular phones, for example, use some type of compression algorithm to
compress in real-time the voice signal over general switched telephone networks. Audio
compression can also used off-line to reduce storage requirements for mail forwarding of voice
messages or for storing voice messages in a digital answering machine.
Speech, and other audio, signals represented by sample data are often compressed by
being quantized to a low bit rate during data transmission in order to obtain faster data transfer
rates. Speech compression allows smaller bandwidth, higher data rates or a combination of these
attributes. It can also be used to store speech like data in a compact form. To compress the
speech signal, speech analysis techniques must applied first.
1.3 PROBLEM STATEMENT
The core objective of this project is to understand different audio signal processing
techniques mainly LPC and FFT analysis and synthesis techniques, and to have a profound
knowledge on how to implement it on Matlab, to observe its characteristics with different
parameters; to compare the results obtained, in general to understand the objectives and basic
techniques in audio signal processing.
The main objective of audio coding or compression to avoid bandwidth and storage issues
associated with audio recording, transmission and storage. This can be achieved by representing
the signal with a minimum number of bits while achieving transparent signal production.
2
1.4 OBJECTIVES
This project aim to implement a Matlab code that analysis and synthesis the speech signal using
the following techniques:
FFT analysis
FFT synthesis
LPC analysis
LPC synthesis
Design a Matlab GUI that makes it easy to deal with the Matlab code.
Note that the quality of the compressed signal that results from both methods is a bottle neck in
this implementation. The compressed speech must be same to that of the source speech.
1.5 METHODOLOGY
A Matlab GUI was implemented to make a visualized tool for speech analysis and
synthesis, the speech signal was drawn in the time and frequency domains to emphasize its
characteristics then compressed using FFT and LPC techniques.
1.6 THESIS LAYOUT
This section presented brief information about the rest of the thesis chapters included its
appendix.
In Chapter 2 (the Literature Review), the speech signal was introduced in conjunction
with its relationship to the communication systems, also different analysis and synthesis
techniques were introduced. Finally, some of the previous work and research papers related to
speech analysis and synthesis were discussed briefly.
In Chapter 3 (the methodology), the FFT and LPC method were introduced as a base
techniques from the Matlab code.
In Chapter 4 (Implementation and Results), the detailed design of the Matlab GUI code
was provided and show graphs for the resulted compressed speech signals in order to compare
them.
3
In Chapter 5 (Conclusion and Future work), the results was discussed related to the
objective and showing the future work and recommendations.
The References part contains the used citations indexed by numbers. It included different
books, websites, and papers.
Appendix A contains a brief description for the Matlab GUI. In Appendix B, the Matlab
code used in this project was presented.
4
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
This chapter introduces the speech signals from the origins to its uses in communication
hence it will begin with a brief summary of the speech signal and a short explanation of the
origins of the speech signal and its classifications.
The sections that follow describe the main types of the speech analysis and synthesis. Then
this chapter ends with the previous work related to speech analysis and synthesis of speech.
2.2 Speech Signals
2.2.1 Origin of speech signals
The speech waveform is a sound pressure wave originating from controlled movements
of anatomical structures making up the human speech production system. A simplified structural
view is shown in Figure 2.1. Speech is basically generated as an acoustic wave that is radiated
from the nostrils and the mouth when air is expelled from the lungs with the resulting flow of air
perturbed by the constrictions inside the body. It is useful to interpret speech production in terms
of acoustic filtering. The three main cavities of the speech production system are nasal, oral, and
pharyngeal forming the main acoustic filter. The filter is excited by the air from the lungs and is
loaded at its main output by radiation impedance associated with the lips. [1]
Figure 2.1: Diagram of the human speech production system
5
The vocal tract refers to the pharyngeal and oral cavities grouped together. The form and
shape of the vocal and nasal tracts change continuously with time, creating an acoustic filter with
time-varying frequency response. As air from the lungs travels through the tracts, the frequency
spectrum is shaped by the frequency selectivity of these tracts. The resonance frequencies of the
vocal tract tube are called formant frequencies or simply formants, which depend on the shape
and dimensions of the vocal tract. [1]
Inside the larynx is one of the most important components of the speech production
system the vocal cords. The location of the cords is at the height of the „„Adam‟s apple‟‟. Vocal
cords are a pair of elastic bands of muscle and mucous membrane that open and close rapidly
during speech production. The speeds by which the cords are open and close are unique for each
individual and define the feature and personality of the particular voice. [1]
The speech signal is created at the Vocal cords, Travels through the Vocal tract, and
Produced at speakers mouth, then gets to the listeners ear as a pressure wave. Speech waveform
is representation of the amplitude variations in the signal as a function of time. The speech signal
can be in the form of sinusoidal speech signal or harmonics speech signal wave.
2.1.1 Classification of Speech Signals
Roughly speaking, a speech signal can be classified as voiced or unvoiced. Voiced
sounds are generated when the vocal cords vibrate in such a way that the flow of air from the
lungs is interrupted periodically, creating a sequence of pulses to excite the vocal tract. With the
vocal cords stationary, the turbulence created by the flow of air passing through a constriction of
the vocal tract generates unvoiced sounds. In time domain, voiced sound is characterized by
strong periodicity present in the signal, with the fundamental frequency referred to as the pitch
frequency, or simply pitch. For men, pitch ranges from 50 to 250 Hz, while for women the range
usually falls somewhere in the interval of 120 to 500 Hz. Unvoiced sounds, on the other hand, do
not display any type of periodicity and are essentially random in nature. [1]
For the voiced frame, there is clear periodicity in time domain, where the signal repeats
itself in a quasiperiodic pattern; and also in frequency domain, where a harmonic structure is
observed. Hence the spectrum indicates dominant low-frequency contents, due mainly to the
relatively low value of the pitch frequency. For the unvoiced frame, however, the signal is
essentially random. In the spectrum there is a significant amount of high-frequency components,
corresponding to rapidly changing signals. It is necessary to indicate that the voiced / unvoiced
classification might not be absolutely clear for all frames, since during transitions (voiced to
6
unvoiced or vice versa) there will be randomness and quasiperiodicity that is difficult to judge as
strictly voiced or strictly unvoiced. [1]
For most speech coders, the signal is processed on a frame-by-frame basis, where a frame
consists of a finite number of samples. The length of the frame is selected in such a way that the
statistics of the signal remain almost constant within the interval. This length is typically
between 20 and 30 ms, or 160 and 240 samples for 8-kHz sampling. [1]
The source–filter model of speech production was shown in Figure 2.2, models speech as
a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal
tract (and radiation characteristic). [4]
While only an approximation, the model is widely used in a number of applications
because of its relative simplicity. It is used in both speech synthesis and speech analysis, and is
related to a linear prediction. [4]
In implementation of the source-filter model of speech production, the sound source, or
excitation signal, is often modeled as a periodic impulse train, for voiced speech, or white noise
for unvoiced speech. The vocal tract filter is, in the simplest case, approximated by an all-pole
filter, where the coefficients are obtained by performing linear prediction to minimize the mean-
squared error in the speech signal to be reproduced. Convolution of the excitation signal with the
filter response then produces the synthesized speech. [4]
2.3 Speech communication
Figure 2.2: Source-Filter Model of Speech
7
The purpose of speech is communication, i.e., the transmission of messages. There are
several ways of characterizing the communications potential of speech. One highly quantitative
approach is in terms of information theory ideas, as introduced by Shannon.
According to information theory, speech can be represented in terms of its message
content, or information. An alternative way of characterizing speech is in terms of the signal
carrying the message information, i.e., the acoustic waveform. [2]
The message is first represented in some abstract form in the brain of the speaker later the
information in the message is ultimately converted to an acoustic signal.
The information that is communicated through speech is intrinsically of a discrete nature;
i.e., it can be represented by a concatenation of elements from a finite set of symbols. The
symbols from which every sound can be classified are called phonemes. Each language has its
own distinctive set of phonemes, typically numbering between 20 and 50. For example, English
can be represented by a set of around 40 phonemes. [2]
A central concern of information theory is the rate at which information is conveyed. For
speech a crude estimate of the information rate can be obtained by noting that physical
limitations on the rate of motion of the articulators require that humans produce speech at an
average rate of about 10 phonemes per second. [2]
In speech communication systems, the speech signal is transmitted, stored, and processed
in many ways. Technical concerns lead to a wide variety of representations of the speech signal.
In general, there are two major concerns in any system:
1. Preservation of the message content in the speech signal.
2. Representation of the speech signal in a form that is convenient for transmission or
storage, or in a form that is flexible so that modifications can be made to the speech signal (e.g.,
enhancing the sound) without seriously degrading the message content. [2]
The representation of the speech signal must be such that the information content can
easily be extracted by human listeners, or automatically by machine.
2.4 Speech analysis
Analysis of speech sounds taking into consideration their method of production and the
level of processing between the digitized acoustic waveform and the acoustic feature vectors also
the extraction of ``interesting'' information as an acoustic vector.
2.4.1 Spectral Analysis of Speech
Frequency-domain representation of speech information appears advantageous from two
standpoints. First, the acoustic analysis of the vocal mechanism shows that the normal mode or
8
natural frequency concept permits concise description of speech sounds. Second, clear evidence
exists that the ear makes a crude frequency analysis at an early stage in its processing. Any
spectral measure applicable to the speech signal should therefore reflect temporal features of
perceptual significance as well as spectral features. [3]
The purpose of spectral analysis is to find out how acoustic energy is distributed across
frequency. Typical uses in phonetics are discovering the spectral properties of the vowels and
consonants of a language, comparing the productions of different speakers, or finding
characteristics that point forward to speech perception or back to articulation. [3]
Formerly, calculation was time-consuming so it was more practical to work on the lab bench
using bandpass filters and then measure the filter output at a range of frequencies. From the
1950s onward, this was done by the spectrograph that burnt a spectrogram onto paper as a
permanent record. Nowadays, a suitable computer program will calculate speech spectra in
seconds. [3]
There are two methods for spectral analysis: the fast Fourier transforms (FFT) and linear
prediction (LPC). FFT finds the energy distribution in the actual speech sound, whereas LPC
estimates the vocal tract filter that shaped that speech. The advantage of FFT is easier setup; the
disadvantage is difficulty identifying formants by speakers with higher pitched voices. LPC has
better success with high-pitched voices, but the settings need to be carefully tuned for each
speaker. [3]
2.4.2 Homomorphic Analysis
The approach is based on the observation that the mouth output pressure is approximately the
linear convolution of the vocal excitation signal and the impulse response of the vocal tract.
Homomorphic filtering is applied to deconvolve the components and provide for their individual
processing and description. [3]
The analyzer is based on a computation of the cepstrum considered as the inverse Fourier
transform of the log magnitude of the Fourier transform as shown in Figure 2.3 The transmitted
parameters represent pitch and voiced-unvoiced information and the low‐time portion of the
cepstrum representing an approximation to the cepstrum of the vocal‐tract impulse response. [8]
In the synthesis, the low‐time cepstral information is transformed to an impulse response
function, which is then convolved with a train of impulses during voiced portions or a noise
waveform during unvoiced portions to reconstruct the speech. [8]
9
Since no phase information is retained in the analysis, phase must be regenerated during
synthesis. Either a zero‐phase or minimum‐phase characteristic can be obtained by simple
weighting of the cepstrum before transformation. The analysis consists of a measurement of the
cepstrum and a characterization of the excitation function by means of a voice-unvoiced decision
and measurement of the pitch during voicing. The parameters used to characterize the spectral
envelope are samples of the cepstrum. [8]
Since the excitation function introduces into the cepstrum sharp peaks at multiples of a pitch
period, we would generally choose the cutoff time to be less than the smallest expected pitch
period. The cepstrum is obtained by weighting the input speech with a suitable window. [8]
A key difference between the Homomorphic filtering of voiced and unvoiced speech is that
in the latter the source and filter components overlap in the low-quefrency region. [7]
Figure 2.3: Homomorphic analyzer
Properties of Homomorphic Filtering [7]
1. It is a non-parametric (transform-based) deconvolution technique.
2. It allows both poles and zeros to be represented.
3. It has wider spurious resonances consistent with the spectral smoothing.
4. It produces due interpretation of cepstral liftering.
5. Can provide a minimum-phase or a mixed-phase estimate of the vocal tract impulse
response by using the complex cepstrum.
10
6. Though more “natural”, than its counterpart in linear prediction, the resultant sound is
sometimes characterized as “muffled”.
2.4.3 Formant analysis
Formants are defined by Gunnar Fant as the spectral peaks of the sound spectrum |P (f)|, of
the voice. In speech science and phonetics, formant is also used to mean an acoustic resonance of
the human vocal tract. It is often measured as an amplitude peak in the frequency spectrum of the
sound, using a spectrogram or a spectrum analyzer, though in vowels spoken with a high
fundamental frequency, as in a female or child voice, the frequency of the resonance may lie
between the widely-spread harmonics and hence no peak is visible. In acoustics, it refers to a
peak in the sound envelope and/or to a resonance in sound sources, notably musical instruments,
as well as that of sound chambers. [3]
In the analysis of speech signals, formant parameters are commonly used to characterize the
vocal tract. Due to the movement of articulators during speech production, the formant
parameters vary with time .These variations are usually slow except in the case of certain speech
sounds of a highly dynamic nature. The relatively more rapid variations in vocal tract
characteristics due to vocal fold oscillations can be understood by considering the widely used
source filter model for voiced speech. [9]
2.4.4 Analysis of Voice Pitch
Pitch in speech is the relative highness or lowness of a tone as perceived by the ear,
which depends on the number of vibrations per second produced by the vocal cords. Pitch is the
main acoustic correlate of tone and intonation. [12]
Voice pitch analysis is a technique that examines changes in the relative vibration
frequency of the voice to measure emotional response to stimuli. Such analysis can be used to
determine which verbal responses reflect an emotional commitment and which are merely low
involvement responses. Such emotional reactions are measured with audio adapted computer
equipment. [10]
Pitch defines two parameters:
a) Low pitch
b) High Pitch
11
2.5 Speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for
this purpose is called a speech synthesizer, and can be implemented in software or hardware
products. The quality of a speech synthesizer is judged by its similarity to the human voice and
by its ability to be understood [5]
Synthesis is the inverse operation of the analysis; it used to obtain the original speech after
it has been modified with any method of the analysis methods. There are many techniques used
for synthesis.
Synthesis can be defined as the process in which a speech decoder generates the speech
signal based on the parameters it has received through the transmission line, or it can be a
procedure performed by a computer to estimate some kind of a presentation of the speech signal
given a text input. [13]
2.5.1 Synthesizer technologies
The two primary technologies generating synthetic speech waveforms are concatenative
synthesis and formant synthesis. Each technology has strengths and weaknesses, and the
intended uses of a synthesis system will typically determine which approach is used. [5]
2.5.1.1 Concatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of segments
of recorded speech. Generally, concatenative synthesis produces the most natural-sounding
synthesized speech. However, differences between natural variations in speech and the nature of
the automated techniques for segmenting the waveforms sometimes result in audible glitches in
the output. There are three main sub-types of concatenative synthesis. [5]
i. Unit selection synthesis
Unit selection synthesis uses large databases of recorded speech. During database
creation, each recorded utterance is segmented into some or all of the following: individual
phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically,
the division into segments is done using a specially modified speech recognizer set to a "forced
alignment" mode with some manual correction afterward, using visual representations such as
the waveform and spectrogram. An index of the units in the speech database is then created
based on the segmentation and acoustic parameters like the fundamental frequency (pitch),
duration, position in the syllable, and neighboring phones. At run time, the desired target
12
utterance is created by determining the best chain of candidate units from the database (unit
selection). This process is typically achieved using a specially weighted decision tree. [5]
Unit selection provides the greatest naturalness, because it applies only a small amount of
digital signals processing (DSP) to the recorded speech. DSP often makes recorded speech sound
less natural, although some systems use a small amount of signal processing at the point of
concatenation to smooth the waveform. The output from the best unit-selection systems is often
indistinguishable from real human voices, especially in contexts for which the TTS system,
which defined in section 2.5.2, has been tuned. However, maximum naturalness typically require
unit-selection speech databases to be very large, in some systems ranging into the gigabytes of
recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been
known to select segments from a place that results in less than ideal synthesis (e.g. minor words
become unclear) even when a better choice exists in the database. [5]
ii. Diphone synthesis
Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-
sound transitions) occurring in a language. The number of diphones depends on the phonotactics
of the language. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and
the robotic-sounding nature of formant synthesis, and has few of the advantages of either
approach other than small size. As such, its use in commercial applications is declining, although
it continues to be used in research because there are a number of freely available software
implementations. [5]
iii. Domain-specific synthesis
Domain-specific synthesis concatenates prerecorded words and phrases to create
complete utterances. It is used in applications where the variety of texts the system will output is
limited to a particular domain, like transit schedule announcements or weather reports. The
technology is very simple to implement, and has been in commercial use for a long time, in
devices like talking clocks and calculators. The level of naturalness of these systems can be very
high because the variety of sentence types is limited, and they closely match the prosody and
intonation of the original recordings. [5]
2.5.1.2 Formant synthesis
This is the oldest method for speech synthesis, and it dominated the synthesis
implementations for a long time. It is based on the well-known source-filter model which means
to generate periodic and non-periodic source signals and to feed them through a resonator circuit
or a filter that models the vocal tract. The principles are thus very simple, which makes formant
13
synthesis flexible and relatively easy to implement. Also, formant synthesis can be used to
produce any sounds. On the other hand, the simplifications made in the modeling of the source
signal and vocal tract inevitably lead to somewhat unnatural sounding result. [13]
In a rudely simplified implementation, the source signal can be an impulse train or a saw
tooth wave, together with a random noise component. To improve the speech quality and to gain
better control of the signal, it is naturally advisable to use as accurate model as possible.
Typically the adjustable parameters include at least the fundamental frequency, the relative
intensities of the voiced and unvoiced source signals, and the degree of voicing. The vocal tract
model usually describes each formant by a pair of filter poles so that both the frequency and the
bandwidth of the formant can be determined. To make intelligible speech, at least three lowest
formants should be taken into account, but including more formants usually improves the speech
quality. The parameters controlling the frequency response of the vocal tract filter and those
controlling the source signal are updated at each phoneme. The vocal tract model can be
implemented by connecting the resonators either in cascade or parallel form. [13]
In addition to the resonators that model the formants, the synthesizer can contain filter
that model the shape of the glottal waveform and the lip radiation, and also an anti-resonator to
better model the nasalized sounds. [13]
2.5.2 Text to speech
A text-to-speech (TTS) system converts normal language text into speech. Synthesized
speech can be created by concatenating pieces of recorded speech that are stored in a database.
Systems differ in the size of the stored speech units; a system that stores phones or diaphones
provides the largest output range, but may lack clarity. For specific usage domains, the storage of
entire words or sentences allows for high-quality output. Alternatively, a synthesizer can
incorporate a model of the vocal tract and other human voice characteristics to create a
completely "synthetic" voice output. [5]
14
Figure 2.4: Overview of a typical TTS system
2.6 Similar papers and researches
2.6.1 Speech Analysis and Synthesis by Linear Prediction of the Speech Wave [6]
B. S. ATAL AND SUZANNE L. HANAUER presented a method for automatic analysis
and synthesis of speech signals by representing them in terms of time-varying parameters related
to the transfer function of the vocal tract and the characteristics of the excitation.
An important property of the speech wave, namely, its linear predictability, forms the
basis of both the analysis and synthesis procedures. Unlike the speech analysis methods based on
Fourier analysis, this method drives the speech parameters from a direct analysis of the speech
wave.
In this paper they describe a parametric model for representing the speech signal in the
time domain they discuss methods for analyzing the speech wave to obtain these parameters and
for synthesizing the speech wave from them.
The speech signal is synthesized by a single recursive filter. The synthesizer, thus, does
not require any information about the individual formants and the formants need not be
determined explicitly during analysis.
Moreover, the synthesizer makes use of the formant bandwidths of real speech, in
contrast formant synthesizers which use fixed bandwidths for each formant.
15
Figure 2.5: Block diagram of the speech synthesizer by Linear Prediction
2.6.2 SPECTRAL ENVELOPES AND INVERSE FFT SYNTHESIS [14]
X. Rodet & P. Depalle presented a new additive synthesis method based on spectral
envelopes and inverse Fast Fourier Transform (FFT-1
). User control is facilitated by the use of
spectral envelopes to describe the characteristics of the short term spectrum of the sound in terms
of sinusoidal and noise components. Such characteristics can be given by users or obtained
automatically from natural sounds. Use of the inverse FFT reduces the computation cost by a
factor on the order of 15 compared to oscillators. They propose a low cost real-time synthesizer
design allowing processing of recorded and live sounds, synthesis of instruments and synthesis
of speech and the singing voice.
In usual implementations of additive synthesis the frequency (f j) and the amplitude (a j)
of the partials are obtained at each sample by linear interpolation of breakpoint functions of time
which describe the evolution of f j and a j versus time. When the number of partials is large,
control by the user of each individual breakpoint function becomes impossible in practice.
Another argument against such breakpoint functions is as follows. In the case of the voice and of
certain instruments, a source filter model is an adequate representation of some of the behavior
of the partials. Then the amplitude of a component is a function of its frequency, i.e. the transfer
function of the filter. That is, the amplitude a j can be obtained automatically by evaluating some
spectral function.
16
CHAPTER 3
MOTHODOLOGY
3.1 INTRODUCTION
In this chapter the spectral analysis with its two types the FFT and LPC was described
briefly. Also some types of voice vocoder are represented in order to understand their role in
speech quality improvement.
The Matlab implementation is commonly represented as flow chart diagrams in order to
make it easy to capture the idea behind the code; this is illustrated in section 3.5.
3.2 FFT METHOD
3.2.1 Short-Time Frequency Analysis
The conventional mathematical link between an aperiodic time function f (t) and its
complex amplitude density spectrum F(ω) is the Fourier transform-pair:
𝐹 𝜔 = 𝑓 𝑡 𝑒−𝑗𝜔𝑡 𝑑𝑡
−
𝑓 𝑡 = −1
2𝜋 𝐹(𝜔)𝑒𝑗𝜔𝑡
−
For the transform to exist, 𝑓(𝑡) 𝑑𝑡
− must be finite. Generally, a continuous speech signal
neither satisfies the existence condition nor is known over all time. The signal must consequently
be modified so that its transform exists for integration over known past values. Further, to reflect
significant temporal changes, the integration should extend only over times appropriate to the
quasi-steady elements of the speech signal. Essentially a running spectrum is desired, with real-
time as an independent variable, and in which the spectral computation is made on weighted past
values of the signal. [3]
Such a result can be obtained by analyzing a portion of the signal through a specified
time window, or weighting function. The window is chosen to insure that the product of signal
and window is Fourier transformable. [3]
3.2.1 Choice of the Weighting Function, 𝒉(𝒕)
A window function is a mathematical function that has a value only inside a chosen
interval and is zero-valued otherwise. [15]
In speech applications, it usually is desirable for the short-time analysis to discriminate
vocal properties such as voiced and unvoiced excitation, fundamental frequency, and formant
17
structure. The choice of the analyzing time window h(t) determines the compromise made
between temporal and frequency resolution. [3]
The rectangular window is the simplest window, it shown in Figure 3.1; it is equivalent
to replacing all values except N values of a data sequence by zeros, making it appear as though
the waveform suddenly turns on and off. [15]
Which it defined as:
𝑤 𝑛 = 1 , 0 ≤ 𝑛 < 𝑁0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Figure 3.1: The rectangular window
Although its simplicity, the rectangular window need a sharp ends which does not exist at
the practice that led to discontinuities at the ends.
One way to avoid discontinuities at the ends is to taper the signal to zero or near zero and
hence reduce the mismatch. This can perform using a humming window which is the common
window in the speech analysis and defined as:
𝑤𝑛 = 0.54 − 0.46 cos(2𝜋𝑛
𝑁 − 1)
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
, 0 ≤ 𝑛 < 𝑁
This is simply a raised cosine and is plotted out in figure 3.2.
18
Figure 3.2: The hamming window
The hamming window is optimized version of the hanning window, shown in Figure 3.3,
to minimize the maximum (nearest) side lobe, giving it a height of about one-fifth that of the
Hann window which it‟s raised cosine window is defined by:
𝑤 𝑛 = 0.5 (1 − cos 2𝜋𝑛
𝑁 − 1 )
Figure 3.3 the hanning window
3.3 LPC METHOD
LPC (Linear Predictor Coding) is a method to represent and analyze human speech. The
idea of coding human speech is to change the representation of the speech. Representation when
using LPC is defined with LPC coefficients and an error signal, instead of the original speech
signal. The LPC coefficients are found by LPC estimation which describes the inverse transfer
function of the human vocal tract.
This method is used to successfully estimate basic speech parameters like pitch, formants
and spectra. The principle behind the use of LPC is to minimize the sum of the squared
differences between the original speech signal and the estimated speech signal over a finite
19
duration. This could be used to give a unique set of predictor coefficients. These predictor
coefficients are normally estimated every frame.
Both speech analysis and synthesis is based on modeling the Vocal tract as a linear All-
Pole (IIR) filter having the system transfer function:
𝐻 𝑧 =𝐺
1 − 𝑎𝐾𝑧−𝑘
Where G is the filter Gain, and 𝑎𝐾 is the parameters that determine the poles. [1]
The two most commonly used methods to compute the coefficients are, but not limited
to, the covariance method and the auto-correlation formulation. The auto-correlation formulation
method was used in this project because it is superior to the covariance method in the sense that
the roots of the polynomial in the denominator of the above equation is always guaranteed to be
inside the unit circle, hence guaranteeing the stability of the system H (z). Levinson - Durbin
recursion was utilized to compute the required parameters for the auto-correlation method. The
block diagram of simplified model for speech production can be seen in Figure. 3.4. [1]
Figure 3.4: simple speech production
There are two mutually exclusive ways excitation functions to model voiced and
unvoiced speech sounds. For a short time-basis analysis, voiced speech is considered periodic
with a fundamental frequency of F o, and a pitch period of 1/ F o, which depends on the speaker.
Hence, Voiced speech is generated by exciting the all pole filter model by a periodic impulse
train. On the other hand, unvoiced sounds are generated by exciting the all-pole filter by the
output of a random noise generator. [1]
The parameters of the all-pole filter model are determined from the speech samples by
means of linear prediction. To be specific the output of the Linear Prediction filter is
20
𝑠 𝑛 = − 𝑎𝑝 𝑘 𝑠(𝑛 − 𝑘)
𝑝
𝑘=1
And the corresponding error between the observed sample S (n) and the predicted value is:
𝑒 𝑛 = 𝑠 𝑛 − 𝑠 𝑛
By minimizing the sum of the squared error the pole parameters 𝑎𝑝 𝑘 of the model were
determined.
𝑎𝑝 𝑘 𝑟𝑠𝑠(𝑚 − 𝑘)
𝑝
𝑘=1
= −𝑟𝑠𝑠(𝑚)
Where m=1, 2 … p
Where 𝑟𝑠𝑠(𝑚) represent the autocorrelation of the sequence 𝑠 𝑛 defined as:
𝑟𝑠𝑠 𝑚 = 𝑠 𝑛 𝑠(𝑛 + 𝑚)
𝑁
𝑛=0
The equation above can be expressed in matrix form as:
𝑅𝑠𝑠𝑎 = −𝑟𝑠𝑠(𝑚)
Where 𝑅𝑠𝑠𝑎 is autocorrelation matrix, ssr is an autocorrelation vector, and, a, is a vector of
model parameters. [1]
These equations can be solved in MATLB by using the Levinson-Durbin algorithm. The gain
parameter of the filter can be obtained by the input-output relationship as follow:
𝑠 𝑛 = − 𝑎𝑝 𝑘 𝑠 𝑛 − 𝑘 + 𝐺𝑥(𝑛)
𝑝
𝑘=1
Where X (n) represent the input sequence.
Further, this equation was manipulated and the error sequence can be given in terms of:
𝐺𝑥 𝑛 = 𝑠 𝑛 + 𝑎𝑝 𝑘 𝑠 𝑛 − 𝑘 = 𝑒(𝑛)
𝑝
𝑘=1
Then,
21
𝐺2 𝑥2 𝑛 = 𝑒2(𝑛)
𝑁−1
𝑛=0
𝑁−1
𝑛=0
If the input excitation is normalized to unit energy by design, then
𝐺2 𝑥2 𝑛 = 𝑒2(𝑛)
𝑁−1
𝑛=0
𝑁−1
𝑛=0
= 𝑎𝑝 𝑘 𝑟𝑠𝑠(𝑘)
𝑝
𝑘=1
+ 𝑟𝑠𝑠(0)
Where 𝐺2is set equal to the residual energy resulting from the least square optimization,
once the LPC coefficients are computed, then input speech frame was determined voiced or non
voiced, and if it is indeed voiced sound, then the pitch was determined. The pitch was
determined by computing the following sequence in Matlab:
𝑟𝑒 𝑛 = 𝑟𝑎 𝑘 𝑟𝑠𝑠(𝑛 − 𝑘)
𝑝
𝑘=1
Where )(kra is defined as follow:
𝑟𝑎 𝑛 = 𝑎𝑎 𝑘 𝑎𝑝(𝑖 + 𝑘)
𝑝
𝑘=1
This is defined as the autocorrelation sequence of the prediction coefficients. The pitch is
detected by finding the peak of the normalized sequence 𝑟𝑒(𝑛)
𝑟𝑎 (𝑛) in the time interval corresponds to
3 to 15 ms in the 20ms sampling frame. The value of this peak is at least 0.25, the frame of
speech is considered voiced with a pitch period equal to the value of 𝑛 = 𝑁𝑝 , where 𝑟𝑒(𝑁𝑝 )
𝑟𝑒(0) is a
maximum value. If the peak value is less than 0.25, then, the frame speech is considered as
unvoiced and the pitch would equal to zero. [1]
The value of the LPC coefficients, the pitch period, and the type of excitation are then
transmitted to the receiver. The decoder synthesizes the speech signal by passing the proper
excitation through the all pole filter model of the vocal tract.
Typically the pitch period requires 6 bits, the gain parameters are represented in 5 bits
after the dynamic range was compressed logrithmaticaly, and the prediction coefficients require
8-10 bits normally for accuracy reasons. This is very important in LPC because any small
changes in the prediction coefficients result in large change in the pole positions of the filter
model, which cause instability in the model. [1]
22
If the speech frame is decided to be voiced, an impulse train is employed to represent it,
with nonzero taps occurring every pitch period. A pitch-detecting algorithm is used in order to
determine to correct pitch period / frequency. The autocorrelation function is used to estimate
the pitch period. However, if the frame is unvoiced, then white noise is used to represent it and a
pitch period of T=0 is transmitted. Therefore, either white noise or impulse train becomes the
excitation of the LPC synthesis filter.
3.3.1 The Levinson–Durbin Algorithm
Levinson–Durbin recursion is a procedure in linear algebra to recursively calculates the
solution to an equation involving a Toeplitz matrix. It solves Ax = b, in which A is a Toeplitz
matrix, symmetric and positive definite; and b is an arbitrary vector. Durbin published a slightly
more efficient algorithm and his algorithm is known as the Levinson-Durbin recursive algorithm.
The Levinson-Durbin algorithm needs a special form of b, where b, consists of some elements of
A. [11]
Let a k (m) be the kth coefficient for a particular frame in the mth iteration. The Levinson-
Durbin algorithm solves the following set of ordered equations recursively, for 𝑚 = 1,2, …… , 𝑝
𝑘 𝑚 = 𝑅 𝑚 − 𝑎𝑘 𝑚 − 1 𝑅 𝑚 − 𝑘
𝑚−1
𝑘=1
𝑎𝑚 𝑚 = 𝑘(𝑚)
𝑎𝑘 𝑚 = 𝑎𝑘 𝑚 − 1 − 𝑘 𝑚 𝑎𝑚−𝑘 𝑚 − 1 , 1 ≤ 𝑘 < 𝑚
𝐸 𝑚 = (1 − 𝑘 𝑚 2𝐸(𝑚)
Where initially E (0) = R (0) and a (0) = 0. At each iteration, the mth coefficient a (m) for
k = 1, 2, m describes the optimal pith order linear predictor; and the minimum error E (m) is
reduced by a factor of (1-k (m) 2
). Since E (m) (squared error) is never negative. 𝑘(𝑚) ≤ 1.
This condition on the reflection coefficient k (m) also guarantees that the roots of a (z) will be
inside the unit circle. Thus the LP synthesis filter H (z) (where H (z) = I/A (z)) will be stable.
And therefore, the correlation method guarantees the stability of the filter. [1]
3.3.2 Linear Predictive Coding (LPC) Vocoder
The LPC vocoder consists of two parts, an encoder and a decoder is shown in Figure 3.5.
At the encoder, a speech signal is divided into short-time segments. Each speech segment is
analyzed. The filter coefficients 𝑎𝑝and the gain G are determined at the LPC analyzer which
shown in Figure 3.6. The pitch detector detects whether the sound is voiced or unvoiced. If the
23
sound is voiced, the pitch period is determined by the pitch detector. Then, all parameters are
encoded into a binary sequence. [1]
At the decoder, the transmitted data is decoded and the signal generators generate
excitation signals, periodic pulses or white noise, depending on the voiced or unvoiced decision.
This excitation signal goes through the Autoregressive (AR) Model H (z) with 𝑎𝑝and G as the
filter parameters, and then a synthesized speech signal is produced at the output of the filter. [1]
Figure 3.5 Encoder and decoder for LPC vocoder
Figure 3.6: LPC analyzer
3.3.3 VOICE EXCITED LINEAR PREDICTION (VELP)
The main idea behind the voice-excitation is to avoid the imprecise detection of the pitch
and the use of an impulse train while synthesizing the speech. One should rather try to come up
with a better estimate of the excitation signal. Thus the input speech signal in each frame is
24
filtered with the estimated transfer function of LPC analyzer. This filtered signal is called the
residual. If this signal is transmitted to the receiver one can achieve a very good quality. The
tradeoff, however, is paid by a higher bit rate, although there is no longer a need to transfer the
pitch frequency and the voiced / unvoiced information. [1]
Figure 3.7: Block diagram of voice-excited LPC Vocoders
To achieve high compression rate, the discrete cosine transform (DCT) of the residual
signal could be employed. The DCT concentrates most of the energy of the signal in the first
few coefficients. Thus one way to compress the signal is to transfer only the coefficients, which
contain most of the energy. [1]
3.3.4 RESIDUAL-EXCITED LINEAR PREDICTIVE (RELP) VOCODER
The RELP vocoder uses LPC analysis for vocal tract modeling. Linear prediction error
(residual) signals are used for the excitation. There is no voiced/unvoiced detection or pitch
detection required. The RELP vocoder, which Un et al. proposed, encodes speech between 6 and
9.6 kbps depending on the quality of the synthesized speech desired. [1]
Using the residual signals as the excitation improves the quality of the synthesized speech
and makes it more natural than the basic LPC vocoders, because there are no miscalculations of
voiced/unvoiced sounds or miscalculation of pitches. The excitation signals of the RELP vocoder
are very close to the ones the vocal tract produces. In contrast, the excitation signals (periodic
pulses) of the basic LPC vocoder are completely artificial. However, the total encoding rate of
the RELP vocoder is larger than most of the other LPC-based vocoder systems. The RELP
vocoder needs to encode sequences of residual signals per segment, which is a large volume of
data, while several bits are needed to encode the voiced/unvoiced decision, pitch, and gain for
the other LPC systems.
3.4 MATLAB IMPLEMENTATION
25
The spectral analysis and synthesis was implemented using GUI Matlab code to perform
both FFT and LPC analysis and synthesis. Figures 3.8, 3.9, 3.10, 3.11, shows the flow charts
which gives a brief representation of these techniques.
Figure 3.8: Flow chart of the FFT analysis
29
CHAPTER 4
IMPLEMENTATION AND RESULTS
4.1 INTRODUCTION
This chapter shows how the Matlab codes were implemented using the GUI (Guide user
interference). In appendix A, a brief description of the GUI was introduced and the code is
presented in appendix B.
The results obtained from the Matlab code were introduces and discussed. Also a brief
comparison between the FFT and LPC techniques was given in this chapter.
4.2 MATLAB IMPLEMENTATION USING GUI:
The analysis and synthesis were implemented in Matlab GUI code, the GUI was named
SAS (Speech Analysis and Synthesis) which shown in Figure 4.1.
Figure 4.1: GUI interference in Matlab “SAS”
Once SAS interference was opened a speech signal can be recorded by clicking on
“Record” button then it automatically saved as a wave file (.wav) to a folder that specified by the
30
user and also the user type the name to that speech signal. SAS allow specifying the time length
of the recorded signal by entering the desired time length in the edit text named “Enter your
recording time period _it is by default 5 seconds”; if the user doesn‟t specify the time length it
considered 5 seconds.
As an example the recorded signal for 2 second long was recorded. The speech is voice
of a woman saying “Hello” which may save in the computer as “Hello.wav”. The sampling
frequency is 44100. The signal and its spectrogram, which shows how the spectral density of a
signal varies with time, are shown below in Figure 4.2 and Figure 4.3 respectively.
Figure 4.2: The wave form of the recorded signal
In the spectrogram the high energy concentration areas are shown in red color. The color
intensity shows the magnitude of the short time Fourier transforms.
As was see from the signal plot at the start and end of the speech there is a silence and
background noise. The actual speech starts at around 0.09 second and end 1.2 seconds. However
every frame was analyze in order to make the system flexible for any other speech signal.
31
Button “play” allows the retrieve of any saved wave file, also once the file was chosen its
wave signal and spectrogram will be displayed. Button “exit” is a push button that pressed in
order to close SAS window.
Figure 4.3: The spectrogram of speech signal under consideration
4.3 SPEECH ANALYSIS AND SYNTHESIS TECHNIQUES:
The spectral analysis was implemented in the SAS code with its two types: FFT and
LPC. A popup menu was used to list the two types and to allow the user to choose one of them,
as shown in Figure 4.4
32
Figure 4.4 GUI interference in Matlab “SAS” with a popup menu
4.3.1 Short time Fourier transforms:
Once the entry “Short time Fourier transforms “ was chosen from the popup menu,
another window appeared to ask a user to select a wave file that has to be analyzed and
synthesized with the short time Fourier transform. For example if the signal “warn4.wav” which
already saved in the computer for a women voice was selected then it must specify the length of
the frames, say a number 20 was entered in the edit text field named” Enter the frame size
length” that means the signal will divided into equal sizes frames each of 20 ms length.
SAS was display the original signal as shown in Figure 4.5 and then uses waterfall
function to draw a 3D graph of the short time Fourier transforms for each frame of the signal, as
Shown in Figure 4.6.
33
Figure 4.5: the original speech “signal warn4.wav”
Figure 4.6: 3D representation of the short time Fourier transform for each frame
Then it synthesis the analyzed signal and display the signal after it has been compressed,
as shown in Figure 4.7.
Different types of windows could be used. The Rectangular window‟s side lobe peaks do
not decrease very much from the peak of the main lobe when compared with the other window
types under consideration. The side lobe also shows somewhat constant level. Therefore, it has
34
more gain because it does not attenuate the signals other than the ones at center of the analysis
window i.e. the signals outside of the main lobe are not attenuated very much.
The Hamming window‟s side lobe peaks are lower in magnitude than their main lobe
peaks when compared with the Hanning and the Rectangular windows side lobe peak and main
lobe peak difference. But they remain somewhat constant after that. Therefore, they do not
increase to attenuating the signals outside the main lobe very much. So, they have the second
largest gain.
The Hanning window side lobes at the beginning are greater than the magnitude of the
beginning side lobes of the hamming window. But, this is observed for the beginning of the side
lobe only. After that, the side lobes have tendency to decrease in magnitude attenuating the
signals outside and farther from the main lobe more and more. This characteristic of the Hanning
window causes its gain to be lower than the rectangular and hamming windows. Here, in SAS
code, rectangular window was used to its simplicity and its high gain.
Figure 4.7: compressed speech signal using the FFT
4.3.2 Linear predicative coding:
Linear predicative coding can be chosen from the popup menu of SAS, and then the wave
file was selected, the “warn4.wav” was assumed selected as before. SAS Perform the LPC
Analysis according to the following specifications:
35
Frame size: 30 milliseconds
LPC method: autocorrelation
LPC prediction order: 13
Frame time increment: 20 milliseconds (i.e. The LPC analysis is done starting every
20ms in time).
The LPC prediction order must be larger than 10 for any voiced signal in order to ensure
a good analysis for the signal and better synthesis for it.
Input signal was organized into N analysis frames and saved in matrix having N columns
and window size rows. Each Frame will contain window samples and will have an overlap of a
difference between window size and analysis step size.
Derive for each frame corresponding N-column matrices containing:
The LPC predictor coefficients (using the Matlab functions levinson)
The predictor gain G (computed from the LPC coefficients and the frame autocorrelation)
The linear prediction error (residual), which shown in Figure 4.8.
Pitch (a frame-by-frame estimate of the pitch of the signal, calculated by finding the peak
in the residual's autocorrelation for each frame).
Parcor (The parcor coefficients give the ratio between adjacent sections in a tubular
model of the speech articulators. There are L parcor coefficients for each frame of speech.)
Figure 4.8: The residual signal generated from LPC analyzer
36
Short time signal processing is usually done using windowing. Frames are windowed to
improve the frequency domain representation. The LPC predictor coefficients are found from the
windowed signal using the Matlab function levinson. Then LPC is used for transmitting
information of the spectral envelope. In LPC analysis, sound is assumed to be a result of an all
pole filter applied to a source with flat spectrum.
The predictor gain is found using the LPC coefficients and the correlation or
autocorrelation which refers to the cross correlation of a signal with itself describes the
redundancy in the signal. The autocorrelation is not very accurate but it guarantees stability.
The predictor gain G increases as the length of the analysis window increase. The length
of the analysis window specifies how much signal is used for calculation at each step it affects
the gain value and the time to obtain that value. Therefore, the predictor gain G increases as the
analysis window length increases and take more time means have delay in gaining that value.
Hence using large window size we can get large gain and we will be able to obtain an accurate
reproduction given input. But it also introduces delay and has computational complexity.
For the “warn4.wav” voice file that assumed selected ,three types of LPC vocoder were
implemented to compress the speech signal and enhancements the quality of that compressed
speech signal, the output from the three vocoders is shown below in Figure 4.9 along with the
waveform of the original signal.
The “warn4.wav” saved as a vector 88200 × 1 size, but after LPC analysis only its LPC
parameters will transmit which had a smaller size than that of the original signal as illustrated in
table 4.1.
The parameter The size
The LPC predictor coefficients 14 × 99
PARCOR 13 × 99
Pitch 1 × 99
The residual signal 1323 × 99
the Gain 1 × 99
Table 4.1: The matrix size of the different LPC parameters in the case of “Hello.wav” signal
37
In addition to the compression obtained from the reduction in the coefficient size the
voiced excited and residual voiced excited method reduce the number of bits using the
quantization method which applied after the analysis to reduce the number of bits in the residual
signal.
Figure 4.9 the resulted LPC compressed signals along with the original signal
SAS allow listening to the resulted voice in order to compare between them, it show the
following messages in the edit text named” the output text”:
“Press any key to play the original sound file”
“Press any key to play the LPC compressed file!”
“Press a key to play the voice-excited LPC compressed sound!”
“'Press a key to play the randomly-excited LPC compressed sound”
Then the corresponding sound played when the key was pressed. The voiced-excited LPC
compressed sound had the best quality sound than the other two; on the other hand, plain LPC
compressed sound has a better quality than the voiced excited compressed speech. On the other
hand, the residual excited LPC has a lower bandwidth and the higher data rate because it
generate the residual signal internally and doesn‟t need to determine if the speech signal is
voiced or unvoiced.
Nonetheless the voice-excited LPC used gives understandable results and is not optimized.
The tradeoffs between quality on one side and bandwidth and complexity on the other side
38
clearly appear here. Since the voice-excited LPC gives pretty good results with all the required
limitations of this project.
The difference between the three types lies in the nature of the residual signal that used to
synthesis the compressed speech. The plain LPC used the residual signal that generated from the
LPC analysis function, which shown in Figure 4.8 The voice-excited LPC use the discrete
cosine transform (DCT) of the residual signal then add some noise to the residual signal before
used it in the synthesis part, the new residual signal was shown in Figure 4.10 below.
Figure 4.10: The residual signal that used in the voiced-exited LPC after it had DCT
transform and noise was added to it
Randomly excited LPC compressed signal was obtained by defining the residual signal as
a random signal, like that shown in Figure 4.11.
39
Figure 4.11: Random residual signal
4.4 COMPRESSION BETWEEN FFT AND LPC TECHNIQUES:
The two methods for spectral analysis, the fast Fourier transforms (FFT) and linear
prediction (LPC) were implemented. FFT finds the energy distribution in the actual speech
sound, whereas LPC estimates the vocal tract filter that shaped that speech.
The advantage of FFT is that it has easier setup; the disadvantage is the difficulty in
identifying formants by speakers with higher pitched voices. LPC has better success with high-
pitched voices, but the settings need to be carefully tuned for each speaker. Hence LPC is
always more successful than FFT at analyzing the spectral properties of female or child speech
since it is not sensitive to higher voice fundamental frequencies.
The FFT is suitable for steady-state wave signals with long waveform data length; on the
other hand, The LPC is suitable for transient signals with short waveform data length. [18]
40
CHAPTER 5
CONCLUSION AND FUTURE WORK
5.1 CONCLUSION
The speech signal was compressed using both the FFT and LPC methods implemented
with Matlab code that represented in Appendix B, LPC method was implemented using three
types of vocoders, plain LPC vocoder, voice excited vocoder and residual excited vocoder. And
it‟s found that a tradeoff between the quality of the speech signal and the size of the bandwidth
must be applied between the three types depending on the type of the application in use. Hence
the voice excited LPC has the best quality compressed sound and the residual excited LPC has
the smallest bandwidth. Also a brief comparison between the two techniques FFT and LPC was
introduced in chapter.4, it showing that each one of them has its own advantages and
disadvantages.
The degree of compression which obtained is highly efficient as seen in chapter 4 and
contain two level of compression one in the size of the parameters and the other on the
performed using Quantization, unfortunately, direct quantization of the speech signal, resulting
in the quantized signal, may degrade the quality of the speech signal to an unacceptable degree.
The recovering of the original signal from the analyzed signal is not perfect due to many
factors; one of those factors is that, in order to perform good analysis the signal was divided into
frames, however some overlapping was occurred between frames which is difficult to be
detected exactly in the synthesized signal.
SAS GUI was implemented to make it easy to deal with the code even for those who are
not familiar with Matlab. And obtain the analysis and synthesis results using less effort.
5.1 FUTURE WORK
The Matlab code could be extended to include other analysis and synthesis techniques,
such as Homomorphic techniques, formant analysis and synthesis and any other techniques
which discussed in chapter.2.
The GUI SAS could be improved to be an advance voice toolbox like the famous
COLEA toolbox for example. And some additional features could be added to the GUI SAS in
order to give the user more control. For example the user may allow choosing the type of
41
windowing in the FFT analysis, from many options like hamming window, hanning window or
rectangular window, the latter is the default windowing used in GUI SAS.
In the future, the result of this project could be implemented as a voice toolbox in
communication systems to improve and control the quality of the speech signals and provided
more bandwidth for the network operators.
42
REFERENCES [1] WAI C. CHU,” SPEECH CODING ALGORITHMS, Foundation and Evolution of Standardized
Coders”, Mobile Media Laboratory, DoCoMo USA Labs, San Jose, California.
[2]L. R. Rabiner and R. W. Schafer, “Theory and Application of Digital Speech Processing Preliminary
Edition”.
[3] James L. Flanagan Jont B. Allen Mark A. Hasegawa-Johnson, “Speech Analysis Synthesis and
Perception” Third Edition,
[4] Source–filter model of speech production, Wikipedia.org. Article at:
http://en.wikipedia.org/wiki/Source–filter_model_of_speech_production
Last seen: 10 July 2013
[5] Speech synthesis, Wikipedia.org. Article at: http://en.wikipedia.org/wiki/Speech_synthesis
Last seen: 10 July 2013
[6] Speech Analysis and Synthesis by Linear Prediction of the Speech Wave, B. S. ATAL AND
SUZANNE L. HANAUER, Bell Telephone laboratories, Incorporated, Murray Hill, new Jersey 07974
[7] Homomorphic Processing of Speech, article at:
http://isites.harvard.edu/fs/docs/icb.topic541812.files/lec11_spr09.pdf
Last seen: 10 July 2013
[8] Speech Analysis-Synthesis system based on Homomorphic Filtering, ALAN V. Oppenheim
[9] Speech formant frequency estimation: evaluating a nonstationary analysis method, Preeti Rao, A. Das
Barman
[10] Pitch analysis, Article at:
http://design-marketing-dictionary.blogspot.com/2009/11/voice-pitch-analysis.html
Last seen: 10 July 2013
[11]Levinson Durbin Algorithm, Wikipedia.org. Article at: http://en.wikipedia.org/wiki/Levinson-
Durbin_algorithm
Last seen: 11 July 2013
[12] Pitch, britannica.com, article at:
http://www.britannica.com/EBchecked/topic/1357164/pitch
Last seen: 11 July 2013
[13] Speech Synthesis, article at:
http://www.cs.tut.fi/courses/SGN-4010/puhesynteesi_en.pdf
Last seen: 11 July 2013
[14] SPECTRAL ENVELOPES AND INVERSE FFT SYNTHESIS, X. Rodet* & P. Depalle
43
[15] Window function, Wikipedia.org, article at:
http://en.wikipedia.org/wiki/Window_function
Last seen: 12 July 2013
[16] Vocoder, Article at: http://wiki.radioreference.com/index.php/Vocoder
Last seen: 12 July 2013
[17] C. K. UN and D. T. Magill, "The Residual-Excited Linear Prediction Vocoder with Transmission
Rate Below 9.6 Kbits/s," IEEE Transactions on Communications, Vol. COM-23, No. 12, pp. 1466-1474,
1975.
[18] LPC SPECTRUM ANALYSIS, Article at:
http://www.soundid.net/SoundID/Papers/SpectrumAnalyzerPDF.pdf
Last seen: 12 July 2013
[19]MATLAB, mathwork.com, “Creating Graphical User Interfaces”.
Last seen 16 July 2013
APPENDIX A GUI
A1
APPENDIX A
GRAPHICAL USER INTERFERENCE (GUI)
A graphical user interface (GUI) is a graphical display in one or more windows
containing controls, called components that enable a user to perform interactive tasks.
The user of the GUI does not have to create a script or type commands at the command
line and the user need not to understand the details of code. [19]
MATLAB GUIs could build in two ways:
1. Use GUIDE (GUI Development Environment), an interactive GUI construction kit.
2. Create code files that generate GUIs as functions or scripts (programmatic GUI
construction).
The code in appendix B was written using the GUIDE. When a GUI was saved,
GUIDE creates two files, a FIG-file and a code file. The FIG-file, with extension .fig, is a
binary file that contains a description of the layout. The code file, with extension .m,
contains MATLAB functions that control the GUI.
The layout editor, which shown in Figure A.1, used to select components from
the component palette, which reside at the left side of the Layout Editor, and arrange
them in the layout area. Some of tools which shown in the Figure A.1 was tabulated in
Table A.1 along with their uses. [19]
The tool Its uses
Figure Resize Tab Set the size at which the GUI is initially displayed when it
run.
Menu Editor Create menus and context
Align Objects Align and distribute groups of components. Grids and
rulers also enable to align components on a grid with an
APPENDIX A GUI
A2
optional snap-to-grid capability.
Property Inspector Set the properties of the components in GUI layout. It
provides a list of all the properties that can set and
displays their current values.
Object Browser Display a hierarchical list of the objects in the GUI.
Run Save and run the current GUI
Table A.1: GUI tools Summary
APPENDIX A GUI
A3
Figure A.1: GUIDE layout editor
The component palette at the left side of the Layout Editor contains the
components that can add to the GUI. Some of them which used in the GUI SAS where
described in table.A.2. [19]
Component Description
push button Generate an action when clicked.
Edit text are fields that enable users to enter or modify text strings
APPENDIX A GUI
A4
static text Controls display lines of text.
popup menu Pop-up menus open to display a list of choices when
users click the arrow.
Table A.2: Guide components that have been used with a short description for each
of them
The Guide code controls how the GUI responds to events. Events include button
clicks, slider movements, menu item selections, and the creation and deletion of
components. This programming takes the form of a set of functions, called callbacks, for
each component and for the GUI figure itself. [19]
A callback is a function that had been written and associate with a specific GUI
component or with the GUI figure.[19]
APPENDIX B THE MATLAB CODE
B1
APPENDIX B
THE MATLAB CODE
function varargout = sas(varargin)
% SAS Application M-file for sas.fig
% FIG = SAS launch sas GUI.
% SAS('callback_name', ...) invoke the named callback.
% Last Modified by GUIDE v2.0 15-Jul-2013 04:35:49
if nargin == 0 % LAUNCH GUI
fig = openfig(mfilename,'reuse');
% Use system color scheme for
figurset(fig,'Color',get(0,'defaultUicontrolBackgroundColor'));
% Generate a structure of handles to pass to callbacks, and store it.
handles = guihandles(fig);
guidata(fig, handles);
if nargout > 0
varargout{1} = fig;
end
elseif ischar(varargin{1}) % INVOKE NAMED SUBFUNCTION OR CALLBACK
try
if (nargout)
[varargout{1:nargout}] = feval(varargin{:}); % FEVAL switchyard
else
feval(varargin{:}); % FEVAL switchyard
end
catch
disp(lasterr);
end
end
% -----------------------------( Record )---------------------------------------
function varargout = record_Callback(h, eventdata, handles, varargin)
fs = 44100;
p=handles.edit11;
y = wavrecord(p*fs,fs);
[filename, pathname] = uiputfile('*.wav', 'Pick an M-file');
cd (pathname);
wavwrite(y,fs,filename);
sound(y,fs);
handles.x = y;
APPENDIX B THE MATLAB CODE
B2
handles.fs = fs;
figure(1)
time = 0:1/fs:(length(handles.x)-1)/fs;
plot(time,handles.x);
title('Original Signal ');
figure(2)
specgram(handles.x, 1024, handles.fs);
title(' Spectrogram of Original Signal ');
guidata(h, handles);
% ----------------------( Exit )----------------------------------------------
function varargout = exit_Callback(h, eventdata, handles, varargin)
cl = questdlg(' Do you want to EXIT?',' EXIT ',...
' Yes ',' No ',' No ');
switch cl
case 'Yes'
close(sas);
clear all;
return;
case 'No'
quit cancel;
end
% ----------------------------( Play )----------------------------------------
function varargout = play_Callback(h, eventdata, handles, varargin)
clc;
[FileName,PathName] = uigetfile({'*.wav'},' Load Wav File ');
[x,fs] = wavread([PathName '/' FileName]);
handles.x = x;
handles.fs = fs;
sound(x,fs);
figure(1)
time = 0:1/fs:(length(handles.x)-1)/fs;
plot(time,handles.x);
title(' Original Signal ');
figure(2)
specgram(handles.x, 1024, handles.fs);
title(' Spectrogram of Original Signal ');
guidata(ht,handles);
% --------------------------( Input the number of framed )------------------------------------------
function varargout = edit10_Callback(h, eventdata, handles, varargin)
user_entry = str2double(get(h,'string'));
APPENDIX B THE MATLAB CODE
B3
if isnan(user_entry)
errordlg(' You must enter a numeric value','Bad Input','modal ')
uicontrol(h)
return
end
handles.user_entry = user_entry;
guidata(h, handles);
% ----------------------( Input the time length of the record )--------------------------------------
function varargout = edit11_Callback(h, eventdata, handles, varargin)
p = str2double(get(h,' string '));
if isnan(p)
errordlg(' You must enter a numeric value ',' Bad Input ',' modal ')
uicontrol(h)
return
end
handles.edit11 =p;
guidata(h, handles);
% -------------------------( Analysis And Synthesis techniques )----------------------------------
function varargout = sas_Callback(h, eventdata, handles, varargin)
val = get(h,'Value');
switch val
case 1
%------( FFT analysis )--------------
[FileName,PathName] = uigetfile({' *.wav '},' Load Wav File ');
[x,fs] = wavread([PathName '/' FileName]);
handles.x = x;
handles.fs = fs;
t=0:1/fs:(length(x)-1)/fs; %and get sampling frequency
figure(1)
plot(t,x); % graph it.
legend(' The orignal speech signal ');
xlabel(' The time (t) ');
ylabel(' Signal amplitude (x) ');
wl = handles.user_entry
guidata(h,handles);
L=length(x);
x=x';
if L<wl
z=wl-L;
APPENDIX B THE MATLAB CODE
B4
x=[x,zeros(1,z)];
end
win=ones(1,wl); % Rectangular Window
hop=ceil(wl/2); % Hoop size of window
if hop<1
hop=wl;
end
i=1;str=1; len=wl; X=[];
while ((len<=L) | (i<2))
if i==1
if len>L % If window size excceds the L of signal for 1st time
z=len-L;
x=[x,zeros(1,z)]; % padding zeros
i=1+1;
end
x1=x(str:len);
%wl=wl*[0:(wl-1)].';
size(wl);
size(x1);
X=[X;fft(x1.*win)]; % Matrix mul
str=str+hop; len=str+wl-1; % to make window overlapping
end
end
figure ,
waterfall(abs(X)); title('Result by waterfall function')
guidata(h, handles);
%--------------( The FFT synthsis )--------------------------
fs = 44100;
ftsize = (size(X,2));
s = size(X);
cols = s(1);
% special case: rectangular window
win = ones(1,ftsize);
w = length(win);
% now can set default hop
h = floor(w/2);
xlen = ftsize + (cols-1)*h
y = zeros(1,xlen);
for b = 0:h:(h*(cols-1))
ft = X(1+b/h,:)';
APPENDIX B THE MATLAB CODE
B5
px = real(ifft(ft));
y((b+1):(b+ftsize)) = y((b+1):(b+ftsize))+px'.*win;
end;
t=0:1/fs:(length(y)-1)/fs; %and get sampling frequency
figure ,
plot(t,y); % graph it.
title(' The compressed speech signal ')
guidata(h, handles);
case 2
%------------( LPC analysis and synthesis)---------
[FileName,PathName] = uigetfile({' *.wav '},' Load Wav File ');
[inspeech,Fs] = wavread([PathName ' / ' FileName]);
handles.inspeech = inspeech;
handles.Fs = Fs;
outspeech1 = speech1(inspeech,Fs);
outspeech2 = speech2(inspeech,Fs); % Voice excitded LPC vocoder
outspeech3 = speech3(inspeech,Fs); % Voice excitded LPC vocoder
% plot results
figure(1);
subplot(4,1,1);
plot(inspeech);
legend(' The original speech signal ');
grid;
subplot(4,1,2);
plot(outspeech1);
legend(' The LPC compressed speech signal ');
grid;
subplot(4,1,3);
plot(outspeech2);
legend(' The voice-excited LPC compressed speech signal ');
grid;
subplot(4,1,4);
plot(outspeech3);
legend(' The randomly-excited LPC compressed speech signal ');
grid;
s1=sprintf(' Press any key to play the original sound file ');
set(handles.edit14,' string ',s1);
pause;
soundsc(inspeech, Fs);
s2=sprintf(' Press any key to play the LPC compressed file! ');
APPENDIX B THE MATLAB CODE
B6
set(handles.edit14,' string ',s2);
pause;
soundsc(outspeech1, Fs);
s3=sprintf(' Press a key to play the voice-excited LPC compressed sound! ');
set(handles.edit14,' string ',s3);
pause;
soundsc(outspeech2, Fs);
s4=sprintf(' Press a key to play the randomly-excited LPC compressed sound! ');
set(handles.edit14,' string ',s4);
pause;
soundsc(outspeech3, Fs);
s5=sprintf(' The output text ... ');
set(handles.edit14,' string ',s5);
end
% --------------------------------------------------------------------------
function [ outspeech ] = speech1( inspeech , Fs, Order)
% (coded and resynthesized)
if ( nargin < 1)
error('argument check failed');
end;
if(nargin<2)
Fs = 16000; % sampling rate in Hertz (Hz)
end
if (nargin<3)
Order = 10; % order of the model used by LPC
end
% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor] =lpc_analysis(inspeech, Fs, Order);
figure,
plot(resid);
% decode/synthesize speech using LPC and impulse-trains as excitation
outspeech = lpcsyn(aCoeff, pitch, Fs, G);
%------------------------------------------------------------------------------------
function [ outspeech] = speech2( inspeech,Fs )
% (coded and resynthesized)
if ( nargin ~= 2)
error('argument check failed');
end;
Order = 10; % order of the model used by LPC
% encoded the speech using LPC
APPENDIX B THE MATLAB CODE
B7
[aCoeff, resid, pitch, G, parcor] = lpc_analysis(inspeech, Fs, Order);
% perform a discrete cosine transform on the residual
resid = dct(resid);
[a,b] = size(resid);
% only use the first Nr (i.e. 50) DCT-coefficients this can be done
% because most of the energy of the signal is conserved in these coeffs
Nr=180;
resid = [ resid(1:Nr,:); zeros(a-Nr,b) ];
% quantize the data using Nq (i.e. 4) bits
Nq=6;
resid = uencode(resid,Nq);
resid = udecode(resid,Nq);
% perform an inverse DCT
resid = idct(resid);
% add some noise to the signal to make it sound better
noise = [ zeros(Nr,b); 0.01*randn(a-Nr,b) ];
resid = resid + noise;
figure,
plot(resid);
title(' the Residual signal plus noise ' );
% decode/synthesize speech using LPC and the compressed residual as excitation
outspeech = lpcsyn2(aCoeff, resid, Fs, G);
%----------------------------------------------------------------------------------
function [ outspeech ] = speech3( inspeech,Fs )
% (coded and resynthesized)
if ( nargin ~= 2)
error('argument check failed');
end;
Order = 10; % order of the model used by LPC
% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor] = lpc_analysis(inspeech, Fs, Order);
[a,b] = size(resid);
% use random siganl as residal
resid= 0.01*randn(a,b);
figure,
plot(resid);
title(' random residual signal ');
% decode/synthesize speech using LPC and the compressed residual as excitation
outspeech = lpcsyn2(aCoeff, resid, Fs, G);
%--------------------------------------------------------------------------
APPENDIX B THE MATLAB CODE
B8
function [aCoeff,resid,pitch,G,parcor] = lpc_analysis(data,sr,L,fr,fs,preemp)
%This function computes the LPC (linear-predictive coding) coefficients that describe a
% speech signal.
% L - The order of the analysis. There are L+1 LPC coefficients in the output
% array aCoeff for each frame of data.
% fr - Frame time increment, in ms. The LPC analysis is done starting every
% fr ms in time. Defaults to 20ms (50 LPC vectors a second)
% fs - Frame size in ms. The LPC analysis is done by windowing the speech
% data with a rectangular window that is fs ms long. Defaults to 30ms
% preemp - This variable is the epsilon in a digital one-zero filter which
% serves to preemphasize the speech signal and compensate for the 6dB
% per octave rolloff in the radiation function. Defaults to .9378.
% The output variables from this function are aCoeff - The LPC analysis results, a(i).
%One column of L numbers for each frame of data
% resid - The LPC residual, e(n). One column of sr*fs samples representing
% the excitation or residual of the LPC filter.
% pitch - A frame-by-frame estimate of the pitch of the signal, calculated
% by finding the peak in the residual's autocorrelation for each frame.
% G - The LPC gain for each frame.
% parcor - The parcor coefficients. The parcor coefficients give the ratio
% between adjacent sections in a tubular model of the speech
% articulators. There are L parcor coefficients for each frame of speech.
if (nargin<3), L = 13; end
if (nargin<4), fr = 20; end
if (nargin<5), fs = 30; end
if (nargin<6), preemp = .9378; end
[row col] = size(data);
if col==1 data=data'; end
nframe = 0;
msfr = round(sr/1000*fr); % Convert ms to samples
msfs = round(sr/1000*fs); % Convert ms to samples
duration = length(data);
speech = filter([1 -preemp], 1, data)'; % Preemphasize speech
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]'; % Compute part of window
for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms
frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms
nframe = nframe+1;
autoCor = xcorr(frameData); % Compute the cross correlation
autoCorVec = autoCor(msfs+[0:L]);
APPENDIX B THE MATLAB CODE
B9
% Levinson's method
err(1) = autoCorVec(1);
k(1) = 0;
A = [];
for index=1:L
numerator = [1 A.']*autoCorVec(index+1:-1:2)';
denominator = -1*err(index);
k(index) = numerator/denominator; % PARCOR coeffs
A = [A+k(index)*flipud(A); k(index)];
err(index+1) = (1-k(index)^2)*err(index);
end
aCoeff(:,nframe) = [1; A];
parcor(:,nframe) = k';
if 0
gain=0;
cft=0:(1/255):1;
for index=1:L
gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;
end
gain = abs(1./gain);
spec(:,nframe) = 20*log10(gain(1:128))';
Figure(3);
plot(20*log10(gain))
title(nframe)
drawnow;
end
% Calculate the filter response from the filter's impulse response (to check above).
if 0
impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);
freqResp = 20*log10(abs(fft(impulseResponse)));
plot(freqResp);
end
errSig = filter([1 A'],1,frameData); % find excitation noise
G(nframe) = sqrt(err(L+1)); % gain
autoCorErr = xcorr(errSig); % calculate pitch & voicing information
[B,I] = sort(autoCorErr);
num = length(I);
if B(num-1) > .01*B(num)
pitch(nframe) = abs(I(num) - I(num-1));
else
APPENDIX B THE MATLAB CODE
B10
pitch(nframe) = 0;
end
% calculate additional info to improve the compressed sound quality
resid(:,nframe) = errSig/G(nframe);
if(frameIndex==1) % add residual frames using a trapezoidal window
stream = resid(1:msfr,nframe);
else
stream = [stream];
overlap=resid(1:msoverlap,nframe).*ramp;resid(msoverlap+1:msfr,nframe);
end
if(frameIndex+msfr+msfs-1 > duration)
stream = [stream; resid(msfr+1:msfs,nframe)];
else
overlap = resid(msfr+1:msfs,nframe).*flipud(ramp);
end
end
stream = filter(1, [1 -preemp], stream)';
%---------------------------------------------------------------------------------------
function synWave = lpcsyn(aCoeff,pitch,sr,G,fr,fs,preemp)
% This function synthesizes a (speech) signal based on a LPC (linear-
% predictive coding) model of the signal..
% LPC synthesis produces a monaural sound vector (synWave) which is
% sampled at a sampling rate of "sr". The following parameters are mandatory
% aCoeff - The LPC analysis results, a(i). One column of L+1 numbers for each
% frame of data. The number of rows of aCoeff determines L.
% G - The LPC gain for each frame.
% pitch - A frame-by-frame estimate of the pitch of the signal, calculated
% by finding the peak in the residual's autocorrelation for each frame.
% The following parameters are optional and default to the indicated values.
% fr - Frame time increment, in ms. The LPC analysis is done starting every
% fr ms in time. Defaults to 20ms (50 LPC vectors a second)
% fs - Frame size in ms. The LPC analysis is done by windowing the speech
% data with a rectangular window that is fs ms long. Defaults to 30ms
% preemp - This variable is the epsilon in a digital one-zero filter which
% serves to preemphasize the speech signal and compensate for the 6dB
% per octave rolloff in the radiation function. Defaults to .9378.
if (nargin < 5), fr = 20; end;
if (nargin < 6), fs = 30; end;
if (nargin < 7), preemp = .9378; end;
msfs = round(sr*fs/1000); % framesize in samples
APPENDIX B THE MATLAB CODE
B11
msfr = round(sr*fr/1000); % framerate in samples
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]';
[L1 nframe] = size(aCoeff); % L1 = 1+number of LPC coeffs
for frameIndex=1:nframe
A = aCoeff(:,frameIndex);
% first check if it is voiced or unvoiced sound:
if ( pitch(frameIndex) ~= 0 )
t = 0 : 1/sr : fs*10^(-3); % sr sample freq. for fr ms
d = 0 : 1/pitch(frameIndex) : 1; % 1/pitchfreq. repetition freq.
residFrame = 0.1*(pulstran(t, d, 'tripuls',1/pitch(frameIndex)))';
% sawtooth width %of 0.001s
residFrame = residFrame + 0.01*randn(msfs+1,1);
else
residFrame = [];
for m = 1:msfs
residFrame = [residFrame; randn];
end % for
end;
synFrame = filter(G(frameIndex), A', residFrame); % synthesize speech from LPC coeffs
if(frameIndex==1) % add synthesize frames using a trapezoidal window
synWave = synFrame(1:msfr);
else
synWave = [synWave;
overlap+synFrame(1:msoverlap).*ramp;synFrame(msoverlap+1:msfr)];
end
if(frameIndex==nframe)
synWave = [synWave; synFrame(msfr+1:msfs)];
else
overlap = synFrame(msfr+1:msfs).*flipud(ramp);
end
end;
%-------------------------------------------------------------------------------------------
function synWave = lpcsyn2(aCoeff,resid,sr,G,fr,fs,preemp)
% This function synthesizes a (speech) signal based on a LPC (linear-
% predictive coding) model of the signal. Used with the voice excited and randomly
%excited LPC.
% LPC synthesis produces a monaural sound vector (synWave) which is
% sampled at a sampling rate of "sr". The following parameters are mandatory
% aCoeff - The LPC analysis results, a(i). One column of L+1 numbers for each
APPENDIX B THE MATLAB CODE
B12
% frame of data. The number of rows of aCoeff determines L.
% G - The LPC gain for each frame.
% pitch - A frame-by-frame estimate of the pitch of the signal, calculated
% by finding the peak in the residual's autocorrelation for each frame.
% The following parameters are optional and default to the indicated values.
% fr - Frame time increment, in ms. The LPC analysis is done starting every
% fr ms in time. Defaults to 20ms (50 LPC vectors a second)
% fs - Frame size in ms. The LPC analysis is done by windowing the speech
% data with a rectangular window that is fs ms long. Defaults to 30ms
% preemp - This variable is the epsilon in a digital one-zero filter which
% serves to preemphasize the speech signal and compensate for the 6dB
% per octave rolloff in the radiation function. Defaults to .9378.
if (nargin < 5), fr = 20; end;
if (nargin < 6), fs = 30; end;
if (nargin < 7), preemp = .9378; end;
msfs = round(sr*fs/1000); % framesize in samples
msfr = round(sr*fr/1000); % framerate in samples
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]';
[L1 nframe] = size(aCoeff); % L1 = 1+number of LPC coeffs
for frameIndex=1:nframe
A = aCoeff(:,frameIndex);
residFrame = resid(:,frameIndex);
synFrame = filter(G(frameIndex), A', residFrame); % synthesize speech from LPC coeffs
if(frameIndex==1) % add synthesize frames using a trapezoidal window
synWave = synFrame(1:msfr);
else
synWave = [synWave;
overlap+synFrame(1:msoverlap).*ramp;synFrame(msoverlap+1:msfr)];
end
if(frameIndex==nframe)
synWave = [synWave; synFrame(msfr+1:msfs)];
else
overlap = synFrame(msfr+1:msfs).*flipud(ramp);
end
end;
% --------------------------------------------------------------------