National Institute for Higher Education, Dublin
School of Electronic Engineering
Thesis Submitted for Degree of Masters of Engineering
VOWEL CODING USING AN ARTICULATORY
The research contained herein was completed by me, the undersigned.
MODEL
By
Mary Murphy B.E. (Elee)
Submitted to
Dr. Sean Marlow B.Sc. PhD.
September 1988
TABLE OF CONTENTS
Abstract
Acknowledgements
1. Introduction
1.1 Coding of Speech at Low Bit Rates
1.2 Vocoders 4
1.3 Articulatory Vocoders 5
1.4 Thesis Overview 7
2. The Mechanism of Speech Production 8
2.1 Introduction 8
2.2 Human Speech Production 8
2.2.1 Voice source Generation 9
2.2.1 Articulation 9
2.3 Models for Speech Production 11
2.3.1 Source Models 11
2.3.2 Articulatory Models 13
2.4 The ASY Synthesiser 13
2.4.1 Meimelstein’s Model 14
2.4.2 Source Excitation for ASY 15
3. Sound Propogation in the Vocal Tract 17
3.1 Introduction 17
3.2 Sound Propogation 17
3.2.1 Transfer Function in the Sampled
Time Domain 20
3.3 Transfer Function of the ASY synthesiser 21
Linear Predictive Coding of Speech
4.1 Introduction 23
4.2 LPC Model for Speech Production 23
4.3 Solution of LPC 27
4.3.1 Prony’s Method 27
4.3.2 Autocorrelation Method 29
4.3.2.1 Solution of the Autocorrelation
Method 30
4.3.2.2 Choice of Window 31
4.3.3 Covariance Method 31
4.3.3.1 Solution of the Covariance
Method 32
4.3.4 PARCQR Analysis (Lattice Method) 33
4.4 Relationship between PARCOR Analysis and
the Acoustic Tube Model 35
Estimation of the Vocal Tract Transfer Function
5.1 Introduction 37
5.2 A New Model of Speech Production 37
5.3 Methods for extracting the Vocal Tract
Transfer Function 39
5.3.1 Inverse Filtering Methods 39
5.3.1.1 Experimental Procedures 42
5.3.2 Covariance over the Closed Glottis
Interval 45
5.3.2.1 Existing Methods for obtaining
Glottal Closure 46
5.3.2.2 Algorithm for Extracting Optimum
Location 50
5.3.2.3 Experimental Procedures 52
5.4 Extraction of Glottal Parameters 55
5.5 Experimental Results and Conclusions 57
5.5.1 Comparison of Preemphasis Methods 57
5.5.2 Comparison of CGI and First-Order
Preemphasis Methods 62
5.5.3 Explanation of Results 74
5.3.3.1 Autocorrelation vs. Covariance
Methods 74
5.5.3.2 Source-Tract Interaction 74
Articulatory Speech Coding
6.1 Introduction 77
6.2 General Speech Coding System 77
6.3 Quantization 78
6.3.1 Vector Quantization 79
6.4 Distortion Measures 80
6.4.1 Distortion Measures Based on the Mean
Squared Error 81
6.4.2 Distortion Mesures Based on the Weighted
Mean Squared Error 83
6.4.3 Itakura-Saito Distortion Measure 83
6.5 Articulatory Speech Coding System 84
6.6 Generation of the Articulatory Codebook 86
6.6.1 Sensitivity Analysis Method 86
6.6.2 Limitations of the Sensitivity Analysis
Method 87
6.7 Linked Codebook Generation 88
6.8 Evaluation of Distortion Measures
6.9 Glottal Codebook Design 94
6.6.3 Training Set Method 88
89
7. Estimation of Articulatory Parameters from the Speech Wave
7.1 Introduction 95
7.2 Inverse Problem of the Vocal Tract 95
7.2.1 Regression Analysis 96
7.2.2 Constrained Optimization 96
7.3 Shirai’s Method 97
7.4 Application of Shirai’s Method 99
7.4.1 Minimization Algorithm 100
7.4.1.1 Adaptation of Mermelstein’s
Model 101
7.4.1.2 Choice of Initial Estimate
7.4.1.3 Choice of Weighting Matrices
7.4.1.4 Computation of derivative of
h(x) 102
7.4.1.5 Choice of Acoustic Parameters
7.4.1.6 Convergence Criteria 103
7.5 Analysis Results and Possible Improvements
95
101
102
103
103
8. Discussion, Improvements and Conclusions 106
8.1 Convergence Techniques 106
8.2 Bit Rates 106
8.3 Resources and Computation 107
8.4 Recording Conditions 107
8.5 Sampling Rate 108
8.6 Limitations of CGI / Alternatives
8.7 Completion of Articulatory Vocoder
8.8 Conclusions 109
References
ABSTRACT
An investigation into articulatory vocoding for vowels, as a means of achieving high quality coding at low bit rates, is carried out in this thesis. Methods of estimating the vocal tract transfer function from thespeech wave are compared, and an algorithm for closed glottis interval(CGI) analysis is developed. CGI analysis is chosen over autocorrelationbased inverse filtering methods.
Various distortion measures for use in Vector Quantization are evaluated,and a new covariance distortion measure is proposed. It is shown that this measure yields close matches from an acoustic codebook.
An articulatory coding system is designed, including a linked codebook of articulatory shapes, based on synthetic speech. A method of generating a similar codebook from real speech is proposed, and an investigation into estimating articulatory parameters from the speech wave is carried out to this end.
1
ACKNOWLEDGEMENTS
Much thanks to my supervisor, Dr. Sean Marlow, for his advice and encouragement during the last two years. I would also like to thank Dr. Ronan Scaife of UCG for providing the ASY synthesiser, and Dr. Frank Owens and Sean Murphy of the University of Ulster, Jordanstown for helping me obtain speech data. Thanks to all my fellow Post-Grads for the crack, coffee, pool and crosswords, and finally to my parents, without whom this would never have been written.
2
CHAPTER 1 INTRODUCTION
Despite the fact that high bandwidth channels and networks are becoming more
viable, coding speech at low bit rates has retained its importance. Specific
applications include:
(i) Digital encryption i.e. situations where high security is required over
low data rate channels such as radio links.
(ii) Cases where memory efficient systems for voice storage e.g. voice mail
are required.
(iii) Mobile telephony, In this case more users can be accommodated on
cellular radio or satellite links.
Developers of digital speech coders strive to optimize the interplay of four
parameters: bit rate, quality, complexity and delay time. As bit rate is reduced,
quality naturally drops off, unless complexity is increased. At high bit rates e.g.
64Kb/s, used in pulse code modulation, quality is not a problem, but it is
believed that high quality coding may eventually be practical at rates as low as
2Kb/s [4].
There are two main types of coders:
(i) Waveform coders, which attempt to send an approximation of the
speech signal,
(ii) Vocoders (Voice Coders), which attempt to model the speech
production mechanism directly, and send parameters which accurately
describe the speech production process.
Vocoders results in a drastic reduction in bit rates, and are of primary importance
for speech coding,
1.1 Coding of Speech at low bit rates
3
1,2 Vocoder
In the basic vocoder synthesiser, it is assumed that speech is generated by a vocal
tract filter being excited by either a regular pulse source or random noise.
Spectral coefficients specifying the vocal tract filter response define the speech
formants, while pitch and voicing are defined by the pitch value and a binary
voiced / unvoiced decision to select the source of excitation.
There are two main types of vocoders:
(i) Channel Vocoders: In this type, typified by Holmes [1], there are
typically 15-20 channels, each being a spectrum analyser consisting of
a bandpass filter, a rectifier and a low pass filter. These are used to
determine the spectral shape.
(ii) LP (Linear Predictive) Vocoders: This type is based on linear
predictive coding (LPC), a speech analysis method first introduced by
Atal et al. [2]. Coefficients of an N-pole digital filter, determined
from LPC analysis, are used to describe the speech.
In both cases, a pitch value and voicing parameter are extracted simultaneously.
In general, speech quality for the two types are comparable with the signal
processing required somewhat greater for the channel type [3].
Features of the above basic vocoder types impose fundamental limitations on the
speech quality obtainable. The main restricting features are:
(i) Regular pulses for the voiced excitation
(ii) Binary voicing decision - the synthesised speech can only be purely
voiced or unvoiced.
Both the above lead to an artificial quality. It is generally agreed that LPC is
based on a clearly oversimplified model of the voice source [5], although this
simplification gives the advantage that a direct and efficient analysis can be used.
1.3 Articulatory Vocoders
An alternative approach to the general vocoder approach design is to use
articulatory parameters for coding speech. As well as providing an economical
description of speech, an articulatory vocoder has the following advantages over
traditional vocoders:
(i) Articulatory parameters model speech production directly, thus inherently
incorporating physiological constraints that exist in the human vocal
tract. For example, transitional effects due to tongue and jaw inertia
may be modelled directly. An articulatory synthesiser has the potential
to produce natural sounding speech at bit rates below 4800b/s.
(ii) The coding (including excitation) parameters have a physiological base
and vary slowly. A parametric model of voiced excitation i.e. a
glottal source model is usually incorporated in an articulatory vocoder.
This overcomes the disadvantages of a binary voicing decision.
(iii) Interpolation between parameters (shapes) result in physically realisable
inteirnediate shapes, which is not always the case for LPC parameters.
Slightly erroneous parameters do not usually result in unnatural speech.
Flanagan [3] has extolled the possibilities of an articulatory vocoder, and
recommended it above other types. However, the success of an articulatory
vocoder is dependent on how accurately articulatory data may be obtained from
the speech signal. Much research has been done into this problem, however
results have mainly been used for speech recognition, and surprisingly little
knowledge has been applied to articulatory vocoding. The simplest type of
articulatory vocoders use area functions obtained from direct speech analysis (LPC)
as the parameters, which offers no significant advantage over traditional vocoder
types. At the other end of the scale is a vocoder recently developed by Sondhi
5
types. At the other end of the scale is a vocoder recently developed by Sondhi
et al„ based on their extremely complex speech synthesiser [4], the parameters of
which are difficult to obtain from the speech signal. One of the main problems
of this type is that its voicing parameters have a physiological base, the detail of
which introduces many problems for speech analysis.
To extract the source excitation (glottal signal), the vocal tract model estimated
from the speech wave must be very accurate. Thus a method which extracts the
true glottal waveform would simultaneously extract excellent parameters to
represent the vocal tract transfer function. From these vocal tract parameters an
accurate representation of the vocal tract shape, and hence positions of the
articulatory organs may be estimated.
1.4 Thesis Overview
In this thesis, a compromise between the two extreme articulatory vocoder types is
proposed, and a quality articulatory vocoder for vowels sounds is designed. The
glottal waveform is extracted from the speech wave by a technique generally
known as glottal inverse filtering. Specifically, this involves a modified type of
linear predictive analysis. Conventional LPC methods are first detailed, and from
these, methods for extracting an accurate vocal tract shape for vowels are
developed and compared. Closed glottal interval covariance analysis is
investigated, and a new improved algorithm is presented for the method. This
method is compared to pitch synchronous and asynchronous analyses which use the
autocorrelation method with various types of preemphasis.
The application of vector quantization to articulatory coding is then discussed, and
a comparison of suitable distortion measures undertaken. A distortion measure,
based on one developed for the autocorrelation method of LPC, but modified for
the covariance method, is then proposed. The results of the comparisons are later
taken into account in determining the best acoustic match for constructing a
6
codebook.
The construction of an articulatory codebook, and methods for quantizing
articulatory parameters are discussed. The idea of a linked codebook of acoustic
and articulatory parameters is presented, and one is generated, based on synthetic
speech. A natural follow-on, using real speech, is proposed, and methods for
constructing such a codebook discussed. This prompts a discussion on methods of
obtaining the articulatory parameters directly from the speech wave, and one of
these methods is investigated in detail.
Finally, methods for improving the existing set-up are proposed, and possibilities
of its extension to other types of speech sounds are outlined. Directions for
future research are proposed.
7
2. THE MECHANISM OF SPEECH PRODUCTION
2.1 introduction
In this chapter, the human physiological speech production process in relation to
vowels, is presented. The generation of voiced source excitation, and the
articulation process are discussed. Articulatory models, which attempt to model
this process to reproduce the acoustic speech waveform, are reviewed. Finally, an
introduction to ASY, the articulatory synthesiser used in this research, is presented.
This chapter forms the background to Chapter 3, which examines the acoustic
process of speech production.
2.2 Human speech production
Voiced speech waveforms are generated by a speech production process consisting
of two main parts:
(i) Voice source generation
(ii) Articulation
The machinery involved is shown in Fig 2.1.
Fig. 2.1 The human speech production mechanism [3]
8
12A Yoicg-Sourgg Generation
The energy source for speech production is the respiratory system pushing air out
of the lungs. The air passes through the trachea and vocal cords of the larynx
into the pharynx (throat cavity) and mouth. The voiced sounds of speech are
produced by the vibratory action (i.e. phonation) of the vocal cords. The larynx
is also known as the voice box, as its purpose is to hold the vocal cords in the
correct position and tension for phonation. The orifice between the cords is
known as the glottis. The vocal cords are suspended within a cage of cartilage,
and by using a set of muscles attached to this cartilage, they can be moved as
required. The action proceeds as follows:
Assume initially that the cords are together. The subglottal pressure increases,
forcing them apart. As the air flow through the cords increases, the local
pressure drops, according to the Bemouilli effect, and this results in the cords
being sucked together again. Thus quasi periodic pulses of air excite the vocal
tract for voiced sounds.
The pitch (frequency of oscillation) depends on both the vocal cord tension and
their mass per unit length. The volume of air through the glottis as a function
of time is roughly proportional to the area of glottal opening. The waveforms are
approximately triangular in shape, and typical duty cycles (i.e. ratio of open time
to total period) are of the order of 0.3 to 0.7. The glottal waveform shape
varies greatly for a given individual, depending on sound pitch and intensity. The
pitch normally ranges from 50 - 200Hz for men, with women and children an
order of an octave higher.
2,2.2- Articulation
The vocal tract is a nonuniform acoustic tube formed by the articulatory organs,
It begins at the glottis and ends at the mouth. It is connected to the nasal tract,
which stretches from the velum to the nostrils. The velum controls the acoustic
9
coupling between the tracts, i.e. when it is open the tracts are coupled
acoustically, and nasalized sounds are produced. The tract accentuates certain
frequencies by resonance, producing each sound with an individual quality. This
process is called articulation.
From observation, vowel sounds are dependent on the vocal tract shape as a
whole, and may be characterised by three parameters:
(i) the minimum cross-section area, usually at the tongue hump
(ii) the distance of (i) from the glottis, and
(iii) the magnitude of the lip opening.
F!g 22 Corresponding positions in die tract for vowels in the words: (1)
"heed", (2) "hid", (3) "head", (4) "had", (5) "hod", (6) "hawed". (7)
"hood", (8) "who’d" [5]
These characteristic shapes are produced by the movement of a combination of
articulators, i.e. the tongue, jaw, lips, and to a lesser extent the velum. The
position of the tongue separates vowels into front/back and high/low classifications.
Klatt [6] also used a lip classification i.e. rounded/unrounded. For nonnazalized
voice sounds, the velum is closed. The physiological basis for these
classifications may be seen in Fig 2.2. A rapid transition from one vowel to
another is known as a dipthong.
Following articulation, the speech is radiated at the mouth. The acoustic
consequences of lip radiation are discussed in Chapter 3.
2.3 Models for speech productioa
All speech utterances, however varied, have one unifying factor - their origin, the
human speech production process. For this reason, the advantages of mimicking
this process are many - such problems as speaker differences, accents and
coarticulation effects may be overcome by accurate modelling. Hence speech
production modelling is a very active area of speech research, contributing to more
natural sounding speech synthesis, better recognition rates, and improved coding
quality. Models for speech production consist of two parts, the excitation of the
vocal cords, and the articulators of the tract.
2.3.1 Source models
Source models vary greatly in detail and accuracy. The most realistic
physiologically based model of the vocal cords is Ishizaka and Flanagan’s two
mass model [7], shown in Fig 2.3.
LUNGS TRACHEA VOCAL VOCAL TR A CT MOVTHM O N CH I COROS
Fig 2.3 Flanagan’s Two - Mass Model [7]
11
The model is a non-linear system, dependent on the supraglottal pressure in the
vocal tract. Thus, it accounts for the interaction between the glottal volume
velocity and the input impedance of the vocal tract. Each vocal cord is described
by two masses, with associated stiffnesses and losses. For voiced sounds
SuR, . u + L. . — -£ = P - P , (2. 1)t ot g 8 t s i v '
where Ps is the lung (subglottal) pressure, P 1 is the supraglottal (vocal tract)
pressure, and Ug is the volume velocity. Rtot and are the total
quasi-stationary resistance and inductance representing the expansion and contraction
of the vocal cords and are dependent on both the glottal area and the area of the
first section of the vocal tract.
The model parameters are the lung pressure, vocal cord tension, and glottal
opening area. Both the pitch and glottal waveform are dependent on the lung
pressure and glottal rest area. The pitch is controlled by the vocal cord tension.
The effect of the acoustic properties of the trachea and lungs have been shown to
be minor by Wakita and Fant [8], and are ignored. Experiments with one-mass
models found that the source tract interaction was very dependent on the assumed
intraglottal pressure distributions [3], while experments with multi-mass models [10]
found they were no better than the two-mass model, in fact they overemphasised
source tract interaction.
While this model produces very natural sounding speech, and is the most accurate
developed, limited knowledge of the voice anatomy and the difficulties of
obtaining the model parameters from the speech wave has meant that more
simplified glottal models are often used. These model the glottal waveform, rather
than its physiological base [10].
12
2.3.2 Articulately models
The design of articulatory models i.e. those which attempt to model the movement
of articulators directly, has always been a prominent area of speech research. The
first articulatory model of significance was developed by Stevens and House [11],
who presented the three parameter model described earlier, representing the vocal
tract shape for English vowels. Using this, Fant [12] attempted to reconstruct
speech spectra based on X-ray data for Russian vowels.
Initially, tract models for speech synthesis [13] used area functions as input.
Following the success of Stevens and House, however, models controlled by
articulators have been developed. This approach supports the view that the value
of an articulatory model is to what extent it can produce significant detail in its
output from simple inputs. Articulator movements try to match the vocal tract
shape rather than resolve individual muscles. For the generation of most
articulatory shapes, a model with seven to ten degrees of freedom should suffice.
Coker’s model [14] uses independent and semi-independent articulators, e.g. (tongue
tip relative to tongue body). A target approach is used where the motion of each
articulator is characterized by a time constant dependent on its weight and the
available muscular forces. Like Coker’s model, Mermelstein’s [15] attempts to
match real X-ray data. Although similar in many respects, each places a different
emphasis on speech production. Coker’s, through incorporating a dynamic
controller, stresses synthesis by rule, while the latter concentrates on interactive
and systematic control of articulatory configurations and the subsequent acoustic
and perceptual effects.
2.4 The ASY Synthesiser
AS Y, the research synthesiser developed by Rubin and Baer [16] using
Mermelstein’s model, is used in this project.
13
The movable articulators in this model (see Fig 2.4), are the tongue, jaw, lips,
velum and hyoid.
2.4.1 Mermelstein’s Model
Fig 2.4 Mermelstein’s Model
These are surrounded by a fixed structure consisting of the rear pharyngeal wall
and the maxilla, which limits the range of the articulators for consonant
articulations. The emphasis when developing the model was manual matching
with X-ray tracings obtained from Perkell [17]. The specification of the key
articulator positions completely determines the vocal tract outline. These are
described as follows:
(i) The jaw is defined by its location, J, (in polar coordinates Sj and 0j)
relative to to the fixed point F ; Sj is usually constant.
(ii) The hyoid has horizontal and vertical coordinates at point H, such that
below H the curve is a function of H alone. The hyoid does not
move much for vowels.
(iii) The tongue body outline is represented by a circle of moving centre
and fixed radius, with polar coordinates (sc and 0C) referenced to FJ.
This makes its position dependent on jaw movement as well as moving
independently.
(iv) The tongue tip and blade move relative to the tongue body. The tip
appears to rotate about point B so is defined by polar coordinates (s
and 0t) relative to B. The blade outline is a curve represented by a
radial coordinate. For vowels, this is simplified, where it is effectively
only a function of jaw and tongue body coordinates.
(v) The lips open and protude relative to the jaw and maxilla. These
positions are described by the height and protrusion, respectively
and pi.
(vi) The velum opens for nasals, and may be ignored for vowel
production.
The anterior outline of the pharynx was observed to be controlled by the hyoid
and tongue body positions and this is incorporated in the model. The rigid outline
was accurately matched with X-ray tracings. By imposing a grid structure on the
resulting outline, the area of function of the tract may be determined, with the
help of previously published data to closely match the vocal tract shape. Section
lengths of 0,875cm are produced, with the number of discrete area sections (and
hence vocal tract length) dependent on the particular configuration. The acoustic
properties of the synthesiser (i.e. its transfer function) are discussed in section 3.3
2.4.2 Source Excitation for ASY
A time domain acoustic waveform, representing Ug, is used to excite the vocal
tract, which can be represented by time varying parameters. These parameters are
based on the Rosenberg model [18] of glottal pulse excitation, shown in Fig 2.5,
and are:
15
(i) pitch period T ( = 1 / fundamental frequency),
(ii) amplitude a,
(iii) duty cycle i.e. ratio of open time to pitch period, = Tp/T
(iv) speed ratio i.e. ratio of rise time to fall time, = Tp/Tn
Fig 2.5 Model for the glottal pulse [18]
16
3. SOUND PROROGATION IN THE VOCAL TRACT.
3.1 introduction
In this chapter, the acoustic properties of the vocal tract described in Chapter 2
are presented. Using various assumptions, the transfer function of the vocal tract
is derived, and reflection coefficents which describe the acoustic sound propogation
through the tube are derived in terms of the cross-sectional area of the tube. The
transfer function of the ASY synthesiser is then described.
3.2 Sound Propogation
To analyse the propogation of sound through it, the vocal tract is modelled as a
nonuniform, time-varying cross-section tube. For frequencies corresponding to
wavelengths that are long in comparison to the tract dimensions, plane wave
propogation of sound along its axis may be assumed. Assuming no viscous or
thermal conduction losses in the air or the tract walls, the sound waves in the
tube satisfy Portnoff’s equations [19]:
8p 8( u / A)- ---- = p (3.1a)
5x 8t
5u 1 5(u/A) SA
8x pc^ 8 t 5t(3.1b)
wherec = velocity of sound;
p = p(x,t) = variation in sound pressure, position x, time t;
u = u(x,t) = corresponding change in volume velocity;
p = density of air in tube;
A = A(x,t) = cross (X) - section area normal to axis of the tube.
Boundary conditions are imposed at either ends of the tube: accounting for sound
radiation at the lips and the nature of the excitation at the glottis.
17
Closed form solutions of eqns. (3.1) are not possible, however numerical solutions
may be obtained. The area function A(x,t) must be known, whether from detailed
direct measurements, or from the speech wave. The solution is very
complicated, thus various assumptions are made.
The vocal tract is regarded as a series of tubes, each of constant cross-section.
As the vocal tract changes slowly, it is reasonable to assume the areas are
constant over a short space of time i.e. the analysis interval (20 - 30ms). Thus
for each section
A(x,t) = A = constant. (3.2)
Thus for the mth uniform tube, eqns. (3.1) are simplified into difference equations
to give a solution of the form:
um ( x , t ) = [ um ( t ■ x/ c ) ' % ^ + x/c) ] (3.3a)
Pm ( x , t ) = [ p* ( t - x/c) + p^ ( t + x/c) ] (3.3b)
which are interpreted as forward and backward travelling waves, with the centre of
each section defined as x = 0, as shown in Fig 3.1.
-£<* + *> t m(t - x) l-------> --------->
1 I
! um(t - T> ! "m<‘ + T> !1 <------ 11 1
<-------- 1
x = - 1/2 x = 0 x = 1/2l <---------------------------------- I >1
sectio n m< >g lo t t is lip s
Fig 3.1 Forward and reverse volume velocity waves in section m.
From eqn. (3.3), and using Portnoff’s equations, a relationship between pressure
and volume velocity may be derived, i.e.
Pm ( x , t ) = pc . [ uffl ( t - x/c) - um"(t + x/c) ]
m
u + ( t - T )----- >
----- >
U B ( t ♦ l )<--------
V i < ‘ -<----------
section m- area Am a
section m-1area A,m- 1
g l o t t i s l i ps
(3.4)
Fig 3.2 Continuity conditions for volume velocity between section m and
section m - 1 .
Examining the continuity considerations between boundaries, shown in Fig 3.2,
and defining the time taken for a wave to propogate half way along a section as
21 / c (3.5)
it follows that
- u " ( t + t:) mm- 1
m-1
- • u +( t - t ), mm
(3.6)
19
and from this a reflection coefficient may be defined:
A 1 Am-1 m /o *7\
•V ” A , - " A ( 3 -7)m- 1 m
or
Am 1 “
(3.8)A . 1 + um- 1 m
For the tubes at either end, boundary conditions are imposed. From Wakita [20],
the acoustic tube is assumed open at the lips, i.e. zero radiation impedance
Ho = 1 (3-9)
From this, the volume velocity at the lips is
uL( t ) = 2 u j ( t - x) (3.10)
At the glottis end, assuming a volume velocity Ug(t) with source impedance Zg,
the glottal area is defined as
pcAm = — (3.11)
g
3.2.1 Transfer function in the sampled time domain
The transfer function of the vocal tract will now be developed in terms of jx.
Defining
mcm = n (1 + |Xj) m > 0 , Cq = 1 (3.12a)
i =1
and
t = 2 (m + 1)t . (3.12b)m
20
a new variable {y} is introduced such that
(3.13a)
(3.13b)
Sampling at T = 4x, manipulating and obtaining the Z - transforms of eqn (3.13)
results in
These expressions will be used to establish a relationship between the acoustic
tube model presented here and the LPC model of Chapter 4.
3.3 Transfer function of the ASY synthesiser
The transfer function for ASY is derived in a similar fashion to the model above.
However its boundary conditions are different, and it incorporates propogation
losses dependent on each X-section area. The attentuation (propogation loss) a is
defined as
Non ideal terminations are accurately accounted for. The radiation at the lips is
represented by a non-zero radiation impedance Zp which consists of a parallel RL
circuit i.e.
(3.14a)
and
(3.14b)
a l / 2 = 1 - 0.007 (A) 1 / 2 (3.15)
Z (3.16)r[ 2 / R + 0 .7 ( 1 - z ' 1 ) ]
21
where
The glottal impedance Zg is modelled by a series RL circuit,where Rg and Lg
are dependent in the glottal area, and averaged over an interval, similar to the
glottal impedance discussed in Section 2.3.1. They are adjusted to account for
effects of yielding vocal tract walls. In the default state, Rg = 50Q and Lg =
1200i2
R = effective radius of lips = (A0 / it).
22
4. LINEAR PREDICTIVE CODING OF SPEECH
fLl Introduction
In this chapter, the fundamental concept of Linear Predictive Coding (LPC) is
introduced, and its suitability to speech acoustics discussed. The basic equations
of LPC are derived, and various formulations are presented for their solution. In
particular, solutions for the autocorrelation and covariance methods are derived, and
the relationship between these and the lattice method derived. Then the lattice
formulation, by showing how the area functions of the vocal tract may be
obtained from its results, unifies the acoustic tube model of Chapter 3, and the
waveform analysis here. These solutions form the basis of the analysis of the
next chapter.
4.2 LPC Model for speech production
In order to efficiently analyse speech at an acoustic level, a knowledge of speech
production is essential. A suitable model of speech production is presented here
which leads to linear predictive analysis of the speech waveform.
Speech waveforms are the result of the vocal tract being acoustically excited. The
vocal tract may be represented by a slowly time varying linear filter. For most
sounds, particularly voiced, the tract changes slowly, and the speech may be
considered to be stationary over a short interval (e.g. up to 20ms). For this
reason, it may be modelled by a digital filter, whose parameters are updated at
regular at regular intervals. The tract is excited by the volume velocity waveform
from the glottis. In the case of voiced speech, this wave is smooth and periodic,
whereas for unvoiced speech, it corresponds to random white noise. This source -
filter model of speech production, shown in Fig 4.1, leads to a simple and
effective method of speech synthesis and coding.
23
V O I C E D
4
V O C A L T R A C T
PARAMETERS
S k
G L O T T A L T I M E - V A R Y I N G> O s(n
E X C I T A T I O N^ D I G I T A L F I L T E R
■'WHITE"E X C I T A T I O N
Fig. 4.1 Source-Filter Model of Speech Production
For LPC, the model may be further simplified by representing the combined
spectral contributions of glottal flow, the vocal tract, and radiation at the lips into
a single time varying all pole filter (see Fig 4.2). The filter is excited by either
a series of periodic pulses generated by the vocal cords (in the case of voiced
speech), or random noise (unvoiced). Thus the difficult problem of separating the
source from the speech spectrum is bypassed.
IMPULSETRAIN
GENERATOR
SP ECT RAL E N V E L O P E
T I M E - V A R Y I N G^ A L L - P O L E ---------> s<n)
D I G I T A L F I L T E R
GRANDOMNGISE
GENERATOR
Fig. 4.2 Linear Prediction Model of Speech Production
An all-pole filter model represents the vocal tract quite accurately and extra poles
compensate for zeros in the spectrum (which occur in nasals). By avoiding zeros,
the filter parameters may be readily determined.
Thus the transfer function of the all-pole filter is of the form
1
where {%} are the coefficients of the digital filter.
In the time domain, the speech samples s(n) are related to the excitation u(n) by
the simple difference equation
where G is the gain.
This is in the form of a linear predictor i.e. the essence of LPC is that, due to
the high correlation between adjacent speech samples, a sample s(n) may be
approximated as a linear combination of previous samples i.e.
H(z) (4.1)M -k
1 - I a^. zk=l
Ms(n) = E ak . s(n - k) + G u(n)
k=0
(4.2)
Ms(n) = I cx s(n - k)
k=0
(4.3)
Using this approximation, the prediction error is
Me(n) = s(n) - s(n) = s(n) - E a. s(n - k)
kk=0
(4.4)
If = ajC) then e(n) = G u(n) is the output of a system having transfer
M - kA(z) = 1 - E a. . z (4.5)
k=l
and since e(n) = G u(n), the prediction error filter is an inverse filter for H(z)
1H(z) = (4.6)
A(z)
S(z)
function
= H(z) (4.7)E(z)
Thus the basic problem of LPC is to find a set of predictor coefficients which
minimise the mean squared error over a finite interval. These coefficients are
obtained by partially differentiating
E = E e (n ) 2 (4.8a)
i . e .
8E[ e(n) ]2 0 k _ j H (4.8b)
8ak
with respect to each % and setting the result equal to zero. This leads to a set
of simultaneous linear equations:
M Ek=0 nE a^ X s(n * k ) . s ( n - i ) = E s(n - i ) . s ( n ) , 1 à i é. M (4.9)
Defining
*P(i,k) = E s(n - i ) . s ( n - k) (4.10)n
26
this can be simplified to
MI ak . V ( i , k ) = ‘P(i.O) 1 ^ i ± M (4.11)
k=l
4.3 Solution of LPC
Ideally, the mean squared error of eqn. (4.8) should be minimized over an infinite
interval, but this cannot be used in practise. The definition of the range of
minimization of the error leads to separate approaches to LPC. Many different
solutions exist for the solution of eqn. 4.11. Four are discussed in this research:
(i) Prony’s method [21]
(ii) Autocorrelation method [21,22]
(ii) Covariance method [23]
(iv) Lattice method [24]
4.11 .£c?n y X msttiod_
Prony’s method is very old, and is important in understanding linear prediction of
speech as it shows explicitly how the voiced speech model may be represented by
complex exponentials in the time domain.
The speech model during voicing corresponds to a sequence of unit samples
(separated by the pitch period) driving an all-pole filter 1/A(z). If transients from
preceding pitch periods are ignored, voiced speech samples during the period will
be proportional to the unit sample response of an all pole filter. Thus the sampled
speech data {s(n)J may be modelled as a linear combination of M complex
exponentials, i.e.
Ms(n) = Z u. ( z . ) (4.12)
i=l 1 1
where zj, i = 1„.M defines roots or zeros of A(z):
27
A ( z j ) = 0 i = 1 , . ,M. (4.13)
E(z) 1S(z) = = (4.14)
A(z) A(z)
If speech were precisely representable by the model of eqn. (4.12), the unknowns
uj and Zj (2M in number) could be obtained by solving the set of 2M
simultaneous equations. Thus if a signal s(n) is composed of precisely M
complex exponentials, then 2M samples suffice to exactly determine the model
parameters.
As M becomes large, the solution to these equations becomes unwieldly. To
avoid solving them, another approach is :
S(z) A(z) = P(z) (4.15)
or
M M-lZ a. s(n - i ) = Z p. 8 . (4.16)
i =0 1 i =0 1 n,J
MZ a. s(n - i ) = 0 n = M,. . .N-l (4.17)
i =0 1
To account for the possibility that the model may not exactly represent a single
pitch period of real speech, an error term is introduced so
MZ a. s(n - i) = e(n) an = 1 (4.18)
i =0 1 u
With the driving sequence e(n) =
28
So {aj} are obtained by minimizing the squared error a where
N-l -a = E e ( n r (4.19)
n=M
which will be shown to be the same result as the covariance method, hi this
case, zeros are also allowed i.e. P(z) is not necessarily equal to 1.
4.3.2 Autocorrelation method
In this method, the speech samples are assumed to be zero outside a certain
interval N, i,e. a windowing procedure is used:
stf(n) = s(n) . w(n) (4.20a)
where
w(n) = 0 n < 0 and n > N - 1 (4.20b)
In this case the limits of summation for E are
N+M-l .E = E e (n) (4.21)
n=0
It can be shown that 4/(i,k) = R(i-k) where R(k), the short time autocorrelation
function is defined as
N+M-lR(k) = E s ( n ) . s (n + k) (4.22)
n=0 w w
Since R(i-k) = R(k-i), eqn. (4,11) is simplified to
ME a . . R(i - k) = R ( i ) 1 4 i ^ M (4.23)
k=l K
29
In matrix form this is
R(0) R(1) R(2) ......... - R(M-l) a l R(0 )R(l ) R(0) R(l) ......... R(M-2) a2 R(l)R(2) R(l) R(0) - - - - - - R(M-3) a 3 = R(2 )
R(M-l) R(M-2) R(M-3) -- R(0) R(M)
4.3.2.1 Solution of the autocorrelation method
The solution of eqn (4.23) is obtained by exploiting the fact that the
autocorrelation matrix is a Toeplitz matrix i.e. it is symmetric and all its elements
along a given diagonal are equal. Thus, an efficient algorithm may be used for
its solution. Many have been proposed [25,26], the most efficient being Durbin’s
recursive procedure [25]. This may be stated as follows:
E(0)= R(0) (4.25a)
k. = [ R(i) - V a ^ ^ R U - j ) ) / E(l_1) (4.25b)j=l J
a ^ = k. (4.25c)l i
aj i} = aj i l ) ' k i a i - j 1} (4.25d)
E( l ) = (1 - k . 2 ) . E(i_1) (4.25e)
Solving these equations recursively for 1 ^ i ^ M, the final solution is
a. = a^M) 1 ^ j ^ M (4.25f)J J J
The quantity E0) is the mean squared prediction error for a predictor of order i.
It can be shown [21] that the quantities kj are bounded by unity i.e.
1 ^ k i ^ 1 (4 .2 6 )
30
and this is a necessary and sufficient condition for A(z) to be stable i.e. for all
its roots to be inside the unit circle.
4.3.2.2 Choice of Window
Because of its assumption of zero valued samples outside the analysis interval, the
autocorrelation method needs a window. The ideal window should have a high
frequency resolution (i.e. its main lobe should be narrow and sharp) and small
spurious distortion outside of this lobe (sharp drop off). A Hamming window [
27] is normally chosen as it has good frequency resolution and side lobes of less
than -40dB. It is of the form
w(n) = 0.54 - 0.46 * cos ( 2IIn / N - 1) 0 ^ n ^ N-l (4.27)
4.3.3 Covariance method
In this method, an interval of length N is also taken, but no assumptions are
made outside this interval, and no windowing is used. Thus E is taken over all
except the first M samples, so that samples outside the interval are not used i.e.
N-l 9E = Z e (n) (4.28)
n=M
Here 'F(i.k) becomes
N-l^ O . k ) = Z s (n - i ) . s (n + k) (4.29)
n=M
'F(i.k) is a cross-correlation function unlike the autocorrelation function used
earlier. From eqn (4,29), eqn. (4.11) may be written as
MZ a . T ( i , k ) = ¥ ( i , 0 ) (4.30)
k=l K
31
which in matrix form is
¥(1,1) ¥(1,2) ........... ¥ ( 1 ,M) ' a l ¥ ( 1 ,0 )¥( 2 , 1 ) ¥ ( 2 ,2 ) ......... ¥ ( 2 ,M) a2
¥ ( 2 ,0 )¥ (3 ,1) ¥(3,2) ........... ¥(3,M) = ¥(3,0)
¥(M,1) ¥(M*;2) ........... ¥(M;M) ¥(m!0 )
4.3.3.1 Solution of the covariance method
The matrix above is symmetric (but not Toeplitz), and has the properties of a
covariance matrix, hence the name. An algorithm known as Cholesky
decomposition [28] is used here by noting that the covariance matrix T is a
positive definite symmetric matrix. may be expressed in the form
¥ = V D V 1 (4.32)
where V is a lower triangular matrix and D a diagonal matrix. These are readily
determined from above by solving for the (ij)^1 element on both sides of the eqn
(4.32) giving
V j " V 1 < J < 1 - 1 (4-33)
and for the diagonal elements
j - 1d. = ¥ ( i , i ) - I V. 2 dk i ^ 2 (4.34a)
k= 1
d j = ^ (1 ,1) (4.34b)
Once V and D are determined, a two step procedure is used to solve for {a}
¥ = V D V 1 (4 .3 5 a )
32
written as
Vl a = D '1 Y (4 .3 5 b )
where VY = VF. Using a simple recursion eqns. (4.35) may be solved for Y
i - 1Y. = - Z V. . Y. M ^ i ^ 2 (4.36a)
i i j = 1 U J
with
Yj = Yj (4.36b)
4.3.4 PARCQR analysis (lattice method)
This method shows that an intermediate set of parameters is obtainable from the
autocorrelation and covariance methods, thus presenting a unified approach to the
solutions. Partial autocorrelation (PARCOR) analysis has found uses in many
practical applications as it is less disturbed by quantization effects, and is not
dependent on the order of analysis used.
The PARCOR formulation defines both forward and backward prediction errors.
These are defined respectively as:
me f ( t ) = s t - s t = s t + Z a. (4.37a)
(4.37b)
m= s, Z b. s t . (4.37c)t - (m+1 ) j = 1 j t - j
m+ 1Z b. s. . (4.37d)
j=l J J
33
bj " V l - j J = l . - . m+l (4.38)
PARCOR is defined as the correlation between residual waves that are the
remainders of the subtraction of predictable parts utilizing the data between the
samples, i.e.
When (s) is stationary
v _ [ e f , t ] [ e b , t - ( m + l ) ] (4 . g s
" r ^ - i i / 2 r~"^2------------- 1 1 / 2I e f , t J I b , t - ( m + l ) J
From the above, it can be shown that the relationship between aj and lq is:
(m+l) . (m) . k (m) (4 4Q)i i m+l m+l-i v
and hence, using earlier formulae,
A 1 (z ) = A (z) - k Ll B (z) (4.41a)m+lv ' nr ' m+l nr ' v
■ 2 - 1 [ - km+l Am<z > ] <4 -41b)
These relations are used to recursively calculate With Aq (z)=1, the inverse
filter in temis of {B,(z)} is
mA (z) = 1 + Z k. Bj . j (z ) (4.42)
i=l
Thus, the PARCOR coefficients are derived sequentially in a multi-stage lattice
circuit, as shown in Fig. 4.3, hence the name lattice method.
It can be shown by direct substitution that the parameters lq are identical to those
obtained from Durbin’s recursion. For the covariance method, they may be
determined from a step-up procedure using eqn (4.42).
34
\— Residual
Fig 4.3 Inverse Filter A(z) in the PAR COR formulation
4.4 Relationship between PARCOR analysis and the acoustic tube model
The problem of extracting the vocal tract shape from the acoustic speech
waveform has been the subject of much research. From Atal [29], the areas of the
acoustic tube model presented earlier can be extracted from formant frequencies
and bandwidths, or from an all-pole transfer function. Wakita [20], using the
boundary conditions imposed in Section 3.2 in the previous chapter, showed that
the same acoustic tube model is equivalently represented by the inverse filter A(z).
This is shown by comparing eqns. (3.14) and (4.41). It can be seen that these
transfer functions are equivalent under the following conditions:
( i ) u = k (4.43)
(ii) The order of the inverse filter A(z), M, equals the number of acoustic
tube sections, M.
(iii) The sampling rate, fs must be the same for both analyses. From eqn.
(3.9), this means
Mef = (4.44)
S 2L
35
(iv) The effect of glottal and radiation characteristics must be removed from
the speech waveform before LPC analysis is carried out. This is
illustrated in Fig. 4.4, which shows a typical glottal waveform obtained
from LPC inverse filtering with no preemphasis. For analysis purposes,
the vocal tract system is assumed linear, and ideal boundary conditions
are assumed, so these effects have to be removed separately. Methods
for removing them are discussed in Chapter 5.
Thus, the reflection coefficients which define the area ratios of the tube may be
obtained directly from the speech waveform.
Fig 4.4 Glottal Waveform Obtained using no preemphasis
5. ESTIMATION OF THE VOCAL TRACT TRANSFER FUNCTION.
5.1 Introduction
In this chapter, the limitations of the linear prediction model of speech production
are discussed, and a more realistic model is introduced. Two classes of methods,
based on LPC analysis, for extracting the vocal tract transfer function are
proposed. The first type, known as inverse filtering, is based on the
autocorrelation method with preemphasis. The second is based on the covariance
method over the closed glottis interval. The inadequacies of existing procedures
for the covariance method are discussed, and a new algorithm is proposed.
Procedures for both methods are outlined, including a robust algorithm for
extracting the glottal parameters. Then results for both methods are presented, and
a qualitative comparison done. The effects of source tract interaction on both
methods is discussed.
5,2. A new model Qf-Speccli_ProduaiQa
The speech production model of Chapter 4 for LPC analysis is rather simplistic.
The system function H(z) is obtained under the assumption of a voice source with
a flat spectrum. Thus it does not directly correspond to the vocal tract transfer
function. A more accurate speech production model is shown in Fig 5.1.
5.1 Improved Linear speech production model
37
The quantities are as follows:
E(z) <—> e(n) = glottal excitation model input
Uq (z) <—> UQ(n) = glottal volume velocity
Ul (z) <—> ul(z) = lip volume velocity
S(z) <—> s(n) = speech pressure wave
e(n) is a mathematical input to a glottal model filter G(z) to generate UQ(n). For
voiced sounds, e(n) is taken to be a a periodic train of pulses, and is the usual
LPC input.
Thus
S(z) = G(z) V(z) R(z) E(z) (5.1a)
= G(z) V(z) R(z) s ince E(z) = 1 (5.1b)
where the corresponding system functions are
G(z) <—> source generation
V(z) <—> vocal tract resonance
R(z) <—> radiation from the lips.
Comparing this with the LPC model of eqn. (4.7),
H(z) = G(z) V(z) R(z) (5.2)
Thus, to obtain V(z), R(z) and G(z) have to removed. Once V(z) is determined
(as discussed in the next section), the corresponding glottal waveform Uq (z) =
G(z) may be extracted by first inverse filtering to obtain
H (z)= UG(z) R(z) (5.3)
V(z)
and then approximating R(z) as
R(z) = 1 - z ’ (5.4)
i.e. integrate the residual to obtain the glottal waveform.
38
5.3 Methods for extracting the vocal tract transfer function
To extract V(z), and hence the true area function, the effects of glottal and
radiation characteristics have to be removed. Two main methods exist for
estimating V(z) accurately:
(i) Inverse filtering (possibly adaptive), followed by the autocorrelation
method.
(ii) Covariance analysis over the closed glottis interval.
5.3.1 Inverse filtering methods.
The pre-processing of the speech signal to remove the effects of G(z) and R(z) is
referred to as inverse filtering. Roughly speaking, the source frequency
characteristic is -12dB/oct and the radiation is +6dB/oct. Thus H(z) has an
approximately -6dB/oct low pass filtering characteristic. To flatten the gross
spectral character, the following methods have been proposed:
(i) First order differentiation:
This involves taking a straight difference i.e.
h = Xt • x t - li . e .
F(z) = 1 - z ' 1
(ii) Adaptive first - order inverse filtering:
This is a low pass filter of the form
F(z) = 1 - kj z ’ * (5.6)
where k\ is the first PARCOR coefficient. This may be improved by repeated
adaptive inverse filtering until kj becomes sufficiently small i.e.
F . ( z ) = F i _1 (z) (1 - k{ i_1) z ’ 1) (5.7)
(5.5a)
(5.5b)
39
(iii) Adaptive multi - order inverse filtering:
A comprehensive method has been proposed by Nakajima [30], It uses a five
stage filter, as shown in Fig 5.2. {e} are correlation coefficients determined from
the waveform at each stage. The first, second and fourth stages are second order
filters which compensate for radiation and source characteristics, while the third
(second order), and fifth (third order) stage filters, compensate for the characteristic
curvature of the spectrum envelope. Though rather empirically derived, this
technique is reported to have yielded very accurate area functions.
Fig 5.2 Adaptive Multi-Order Multi-Stage Filter.
(iv) Pitch synchronous first order inverse filtering.
In this method [31], the radiation effect is first removed by the preemphasis
method (i), i.e. straight differentiation. Then an analysis frame centred at glottal
closure is taken to determine V(z). The motivation for this is seen by looking at
Fig 5.3. Fig, 5.3a shows an idealized glottal waveform. This waveform is
effectively differentiated once during the speech production process (due to lip
radiation), and once during preemphasis. Thus the source contribution to the
output speech waveform is shown in Fig. 5.3b. Large impulses occur at glottal
closure, with smaller peaks at opening. The difference in peak size is due to the
fact that glottal waveform at closure is far steeper than at opening. If
secondary peaks are ignored, the source contribution in a frame centred at the
closure peak will have a flat spectrum (i.e. impulse response), so the spectrum
40
obtained by analysing this frame will be that of the vocal tract alone.
(a) Idealized Glottal Waveform, Uq
Fig 5.3 (b) U0 differentiated twice
By applying a Hamming window, the peaks at opening will be attenuated even
further, enforcing the validity of the proposal. In order to avoid the effects of
the opening location further, a frame of slightly less length than a pitch frame
may be used.
41
5.3.1.l Experimental procedures
(i) Preemphasis over a long analysis frame
The algorithm for extracting V(z) by this method is shown in Fig 5.4. This
algorithm, and all those following, were implemented in ’C’ on a MicroVax
computer. The speech in this, and all other, cases was sampled at 7.5Khz. An
initial estimate of the pitch is obtained to determine an appropriate analysis frame
length, as well as for estimating the glottal parameters later. the chosen
preemphasis is carried out, and the frame Hamming-windowed. An overlap of
half a frame is used. The LPC predictor coefficients are extracted using Durbin’s
recursion algorithm (autocorrelation method), as described earlier in Section 4.3.2,
with a filter order of M=8. These coefficients are then used in a direct form
all-zero filter, V(z), through which the unwindowed, unpreemphasised speech is
filtered to obtain the residual signal. This is then integrated over a pitch period,
chosen so that the approximate closure point (as depicted from the maximum
value of the residual) is towards the end of the interval, so as to facilitate
extraction of the glottal parameters. The corresponding formants and bandwidths,
are obtained using a root solving procedure for V(z). The area function is
obtained from the reflection coefficients of Durbin’s recursion.
fin Pitch synchronous method.
The algorithm for extracting V(z) by this method is shown in Fig 5.5, In this
method, an initial estimate of V(z) is first obtained, as in (i). Again the
maximum excitation of the residual is taken as the closure point. This is then
used as the centre of a pitch frame. The speech is preemphasised using method
(i) and a Hamming window applied, followed by Durbin’s recursion. The speech
is then inverse filtered to obtain the residual, which in turn is integrated to obtain
the glottal waveform, and its corresponding parameters. The formants, bandwidths
and area function are then extracted.
42
5.3.2 Covariance over the Closed Glottis Interval
The basis of this method is that the glottis, as discussed in Chapter 2, closes for
a significant portion of each pitch period.
Hence, for the model of Fig 5.1
uQ (n) = 0 (5.8)
over the closed glottis interval (CGI).
Defining the effective driving function Q(z) as
Q(z) = UG (z) R(z) (5.9)
the speech production model is of the form :
Ms(n) = Z a^ . s(n - k) + q(n) (5.10)
k=0
When the glottis closes, uç(n) = 0, hence q(n) = 0, and
Ms(n) = Z ak . s(n - k) (5.11)
k=0
Thus one sample after closure, the waveform becomes a freely decaying oscillation
(as in Prony’s method). In practise there is an error term e(n), the total mean
squared error defined as in the covariance method.
n+N-M-1c<vj(n) = Z e ( j ) 2 (5.12)
j=n
where e(n) and ajyj(n) are theoretically zero for n ^ Lc + 1 and n + N - M <
L0. Usually, the normalized mean squared error r|(n) (NMSE) is used i.e.
a M(n)ri(n) = (5.13)
aQ (n)
where ot^n) is the input signal energy. The NMSE for the vowel /a/ is shown
45
in Fig 5.6.
Fig 5.6 Normalized Mean Squared Error for the vowel /a/
5.3.2.1 Existing Methods for obtaining glottal closure
Two methods have been postulated which extract the instant of glottal closure, and
hence the vocal tract filter over the CGI, using the NMSE. These are:
(i) Wong, Markel & Gray’s method [32]
(ii) Strube’s determinant method [33].
Both methods use the covariance method of analysis, performed sequentially over
the analysis frame. However, they differ in their interpretation of the NMSE.
(i) If the glottis is assumed closed for
Lc + 1 ^ n < LQ (5.14)
then q(n) = 0 over this interval, with initial conditions taken from sil^). In this
method, the point of glottal closure is found by noting the first sample n, such
that Tij^Cn,) = 0, or in practise below a certain threshold, dependent on the
46
minimum error. At the next sample n 2 where non - zero (or above threshold)
error occurs, the opening location is defined as
LQ = n2 + N - M - 1 (5.15)
Normally the segment taken for obtaining V(z) is the place of minimum error in
this interval.
(ii) Here, it is assumed that the vocal tract is most strongly excited at the instant
of glottal closure. This instant should correspond to the highest increase in
amplitude of the speech waveform, as the glottis closes far more abruptly than it
opens. The prediction error will be large at this point, followed by good
predictability, based on the speech being represented by freely decaying oscillations
after closure. Thus for a segment which contains the glottal closure, the NMSE
is maximum, after which it drops rapidly. This maximum corresponds to the
maximum value of the Gram determinant [33], i.e the determinant of the
covariance matrix over the chosen interval.
The above methods were tested for various vowels from two different speakers.
A good indication of the accuracy of the transfer function obtained by any method
is the quality of the glottal waveform obtained after filtering. If the correct
transfer function has been extracted, the waveform should be smooth, contain no
ripple due to formant remnants, and in shape and appearance generally approach
an idealized glottal waveform, such as the one shown in Fig 5.3a. While the
methods extracted good glottal waveforms in some instances, there were also cases
where the methods failed.
These are illustrated in Figs. 5.7 and 5.8. Fig 5.7a shows the NMSE graph for
the vowel /er/ (T = pitch frame length). In this case, it can be seen that the
minimum error occurs during the open phase (shown at (b)), nowhere near closure,
which actually occurs shortly after the maximum drop (shown at (a)). So, using
method (i), the corresponding glottal waveform for /er/, shown in Fig 5.7b,
47
Tic c:
(a)
(b)
Fig 5.7 (a) NMSE and (b) corresponding glottal waveform for the vowel
/er/ obtained using Wong, Markel and Gray’s Method
48
(a)
Fig 5.8 (a) NMSE and (b) corresponding glottal waveform for the vowel
AV obtained using Strube’s Method
49
obtained by taking point (b) as the start of the CGI, is totally inaccurate.
Fig 5.8a shows the NMSE graph for the vowel /u/. In this case, from Fig 5.8a
the maximum location occurs at the beginning of the open phase for /u/, (shown
at (b)), and there are spurious drops which do not coincide with glottal closure
(shown approximately at (a)). So, using method (ii), the glottal waveform for /u/,
shown in Fig 5.8b, obtained by taking point (b) as the beginning of the CGI
interval, is totally inaccurate.
It is obvious from the above that a method is required which takes into account
both the minimum and maximum errors, and their relative positions. Neither of
the above methods have discussed instability which often occurs for the covariance
method. Methods have been proposed for stabilizing the covariance result [34].
However, it is accepted that the area function obtained after stabilization is
meaningless, so doing this would be unacceptable for this research. A new
method is proposed here, which starting with a reasonable estimate of closure
location, extracts a very accurate stable vocal tract transfer function, and hence the
area function.
5.3.2.2 Algorithm for extracting the Optimum Location
The flow chart for extracting the optimum location is shown in Fig 5.9. Because
of the postitioning of the pitch analysis frame, the point of closure should be
located in the first part of the frame. Initially the error range i.e. the maximum,
Hmax minimum, rimin are obtained. From this the threshold value ri^ is
defined as
n th = 2 '° * \ i n ( 5 - 16>
First the general location of the maximum drop is obtained by finding the
maximum of r|(n) - r|(n-5) in the first half of the frame, such that
T |(n -5 ) ^ 1 .5 * n th ( 5 .1 7 )
50
Fig 5.9 Algorithm for extracting glottal closure location
Once the general area is established, the maximum drop 8 is searched for in the
immediate location such that
5 (1W - W (5 18a)
51
and
just after the maximum. If this is clearly defined, the optimum location, L, is
taken as
L = 5 + 4 (5.18c)
Taking four samples after is advisable, as there may be an oscillatory effect at
exact closure [35]. Otherwise the error is smoothed in order to eliminate the
effects of any spurious rises above the threshold in the CGI to be located.
Starting at the first drop location found earlier, the first point to go below the
threshold is found, according to method (i). The interval below the threshold must
be at least M samples long, so the threshold may have to be increased slightly,
or decreased if the interval seems too long e.g. corresponding to an open quotient
^ 0.8, which would be extremely rare. The minimum error, as near to the
beginning as possible, is chosen. This is to avoid the risk of entering the open
region, which is quite possible for short closures and large analysis lengths, as is
used in method (i). Including the open region would have a very detrimental
effect on the formant frequencies extracted.
Usually such a comprehensive method results in a stable filter. However in cases
where it does not, the immediate location is searched for appropriate filter
coefficients.
5.3.2.3 Experimental Procedure
The flow chart of the analysis is shown in Fig 5.10. A block of speech is read
in and its pitch determined using the SIFT algorithm [21]. The value of pitch is
updated every third pitch frame. The approximate closure point is initially taken
as the maximum excitation of the speech signal. By starting at twenty samples
before this, the pitch frame used for analysing the NMSE will definitely include
the full closed glottis interval (CGI). Starting at this location, the speech is
T|(n) ^ 1 .5 * T)th (5 .1 8 b )
52
preemphasised and sequential covariance analysis, using Cholesky decomposition, is
carried out. Preemphasis is used because it makes the drop in the NMSE more
pronounced. An interval of length N=22 is taken, with the order of analysis
M=8. Computation is saved by noting that each time a new covariance matrix is
required, only one new row is added, as only one sample is being advanced at a
time. The NMSE from each sample is saved in an array. At the end of the
pitch frame, this array is analysed to extract the location from which to determine
V(z).
For the chosen location, the LPC coefficients are determined, and the
corresponding filter stability is tested by obtaining the corresponding reflection
coefficients, according to eqn. (4.41), and checking for stability according to eqn.
(4.26). An alternative location in the immediate area is chosen, if necessary.
Often, by examining the stability of each location, long regions of stability may
be found, and this may be a good alternative indication of the closure region.
However, the additional computation does not deem it worthwhile, unless
absolutely necessary.
The residual (error) signal q(n) is now obtained by passing the unpreemphasised
speech through the inverse filter V(z). In order to extract the glottal waveform
and corresponding parameters, an interval of at least two pitch periods is used.
The glottal waveform over an interval of a pitch period is obtained by integrating
q(n), such that a full glottal pulse is included i.e. the interval begins just before
opening, ending after closure. This simplifies extraction of the glottal parameters.
Three frames are analysed at a time, and the frame with the most realistic
parameters is used to code the speech. To decide which frame to use, the
foimants, bandwidths and area functions are compared for the three cases. Frames
with abnormally large bandwidths, or unstable filters are discarded immediately.
Of the remaining, the one with the most realistic looking area function is
extracted, for example, an area function may have a large range (extreme values)
54
due to the small bandwidths extracted. If necessary, to ensure continuity between
frames, and reasonable vocal tract shapes, the bandwidths may be systematically
damped e.g. by 50Hz, i.e. a corresponding change in the filter parameters of
aj = a j exp( -5 0 in T ) (5.19)
where T is the sampling rate, as suggested by Mallawany [36]. In fact much
research in formant analysis in the past has used default values for bandwidths
[12], so this is not unreasonable. In order to ensure continuity and stability,
Mallawany suggested using a special Hamming window on the covariance analysis
frame. However, when this was tried, it resulted in a total smearing of the
formants obtained. This was to be expected, as the advantages of the covariance
method (i.e. no need for windowing) was destroyed, making its accuracy no better,
and probably worse than the autocorrelation method.
5.4 Extraction of glottal Parameters
The algorithm for extracting the glottal parameters, as defined in Section 2.4.2, is
shown in Fig 5.11. A robust, reliable algorithm for extracting these parameters is
required, particularly in the case of the glottal waveform obtained from
autocorrelation methods, as it may contain undesirable ripple. For this reason, the
glottal waveform is smoothed before examination.
The peak value, which separates opening and closing portions, is first found, as it
is always the most reliable point in the waveform. The closure point is found by
the cessation of negative slope to the right of the peak, or else by the pitch
period end, whichever comes first. This second clause is required, particularly in
the case of an autocorrelation derived waveform, as sometimes the negative slope
continues beyond closure. The opening location, which can be more difficult to
locate, is obtained by going left of the peak in the same manner. Bumps in the
waveform (due to formant ripple) are ignored if the amplitude between the peak
55
location and the current location is less than the overall range of the signal
divided by 2.5. This value was found to be appropriate from visual examination
of signals.
Once these three locations are determined, the glottal parameters are easily
extracted. The closure location is used to determine the amplitude of the glottal
pulse.
5.5 Experimental Results and Conclusions
In this section, a comparison of the waveforms obtained from each
autocorrelation-based method is carried out, and then the best of these, the
first-order preemphasis method is compared to the covariance derived analysis
method. This is followed by a general discussion and explanation of results
obtained.
5.5.1 Comparison of Preemphasis Methods
The glottal waveforms obtained for each inverse filtering method for the vowels
/a/ and /er/ are shown in Fig 5.12 and 5.13. These waveforms are consistent
with the general trend of results obtained from all the vowels analysed. The
most consistent results for the vast majority of vowels are obtained using first
order preemphasis. It appears that the other methods use too much preemphasis,
which is as bad as too little [37].
While good results for the adaptive multi-stage method are reported for Japanese
vowels, it does not work well for English vowels. For the vowel /er/, it yields
reasonable looking waveforms, however in general the results were similar to that
for the vowel /a/. A filter whose coefficients are based on the incoming signal
should detect when less preemphasis is required, so in the majority of cases, the
method is not much use. Similar results are obtained for the first order adaptive
57
(a) First Order Preemphasis
(b) Adaptive First Order Preemphasis
5.12 Glottal Waveforms obtained for the vowel /a/ using preemphasis
methods
(c) Adaptive Multi Order Preemphasis
(d) Pitch Synchronous with First Order Preemphasis
5.12 Glottal Waveforms obtained for the vowel /a/ using preemphasis
methods
I
ti
(a) First Order Preemphasis
(b) Adaptive First Order Preemphasis
5.13 Glottal Waveforms obtained for the vowel /er/ using preemphasis
methods
' « et ' < i .e : ' r.i : : 1 > u , U j 'is « ' j l T I ? j w e :' j io T c c «Ve : :N«r** h 1 « 4 Ti»#
(c) Adaptive Multi Older Preemphasis
(d) Pitch Synchronous with First Order Preemphasis
5.13 Glottal Waveforms obtained for the vowel /er/ using preemphasis
methods
method, in this case it works well for /a/ but not for /er/.
In the case of the pitch synchronous method, the results were also inconsistent.
This may be attributed to the difficulty of defining glottal closure and opening,
the former from the residual, and the latter from the resulting glottal waveform.
Using iteration to improve the waveform, as suggested by Hedelin [38], actually
worsens the situation, due to the lack of a very accurate starting estimate. In
cases where the open quotient is high, the interval for analysis should be a lot
less than a pitch period to avoid the open region as discussed earlier, which is
not recommended for the autocorrelation method.
Thus, ordinary preemphasis, despite its discontinuities and occasional erratic
behaviour, is the preferred inverse filtering method. It will be compared to the
CGI method in the next section.
5.5.2 Comparison of CGI and first order preemphasis methods
The glottal waveforms obtained for the two methods are shown in Fig 5.14 - 5.17
for the vowels /u/, /ae/, /ah/ and /uh/. It is immediately obvious that in all cases
the waveforms obtained from CGI analysis are superior, containing far less (if
any) formant ripple. Generally the waveforms obtained from CGI analysis are of
a high standard. In some cases, reasonable waveforms are obtained from both
methods, and a comparison of the properties of the transfer function obtained from
both methods is advisable.
The area functions and corresponding bandwidths obtained for both methods for a
wide cross-section of vowels are shown in Fig 5.18 - 5.24. In most cases, the
area profiles obtained from both methods are quite similar, differing in finer
details. It is noted that formant bandwidths are in general far greater for the
autocorrelation method, First formants are reasonably close in both cases, with
the percentage error greater for the higher formants.
62
XT? ITe? 377? >U c:1 3» N o r m I i< d T i i m
(a) First Order Preemphasis
(b) CGI analysis
5.15 Glottal Waveforms obtained for the vowel /ae/
»I
V os 1 « i . ; : 1 d : : ' , i s , 1 c c:1 3 7 1 ? 3 T c ? j l o : : ' j t c e : ' *ieNirnn I I led T | *«
(a) First Order Preemphasis
(b) CGI analysis
5.16 Glottal Waveforms obtained for the vowel /ah/
o*.O’"
$ iw 'Zr_ •j JU*
n
} w -i— 71^-1— fi” t-
N C R M A L I It'D 01 Vt *HCr " r oh' o l O T t i sVit
1 2 3 4
F 254 1795 2215 3004 Hz
B 44 108 440 153 Hz
(a) CGI method
F 272 1898 2542 3072 Hz
B 18 184 622 191 Hz
(b) First order Preemphasis method
Fig 5.18 Area Functions, Formants and Band widths for vowel fi/
67
rf-v,>
§<rK*
C-5^
t-
*1» T .Ix ' / « T /•; ' «'x T jtncrmalizio oisUNcr « \ tt vcc
FROM OuOTTICL ' >'*
1 2 3 4
F 497 749 2000 2869 Hz
B 188 322 495 250 Hz
(a) CGI method
’J"(M,
t:-|
Vx ' .V ' .'v. ' 1 .■;< ' >’« ' ' -'cc ' r.V.HCKMALlZtD OIOUNCf FROM OuOTIIS
F
B
1
666
301
799 2542
258 134
3072
232
Hz
Hz
(b) First order Preemphasis method
Fig 5.19 Area Functions, Formants and Band widths for vowel /ah/
68
vJVCNt-D-
&iM '3-a-',
r - —i— t — i— “r ~ >— rr— i— i ij ?s /:* . * * j « ^ te *■«NOSMALtZED OiSTANCt FROM Ol OTTIS
1 2 3 4
F 601 1199 1953 2870 Hz
B 70 173 345 44 Hz
t*- < "mJ ,UJ“'S,
(a) CGI method
V I 111 T — I I I -T I — 'Ift ' ' - ! « ' r, •; ' i1«
HO'iMM.IZfO 0 1liTANCI FROM OlOTTIS
F
B
1 2 3 4
646 1399 2056 2873 Hz
306 600 375 83 Hz
(b) First order Preemphasis method
Fig 5.20 Area Functions, Formants and Band widths for vowel /uh/
69
1 2 3 4
F 405 1441 2063 3021 Hz
B 54 244 2 1 2 97 Hz
(a) C G I method
1 2 3 4
F 396 1648 214 9 2905 Hz
B 134 302 340 156 Hz
(b) First order Preemphasis method
Fig 5.21 Area Functions, Formants and Band widths for vowel /er/
70
1a
£»
<*
1 2 3 4
F 6 16 1088 2 2 2 3 3206 Hz
B 42 3 1 2 196 79 Hz
(a) CGI method
1 2 3 4
F 604 110 6 2 19 5 3 19 5 Hz
B 105 778 344 83 Hz
(b) First order Preemphasis method
Fig 5.22 Area Functions, Formants and Band widths for vowel /a/
71
(j
ft
¥
1 2 3 4
F 341 1 2 1 6 20 14 2 7 8 1 Hz
B 35 134 270 109 Hz
(a) CGI method
s'-s—■
¥>#« i n< '
« 1 t!« ' ' / fj >':tnc* m a i : j i d o i i t
' i'« 1 i'« ' •’« 'ANCt FROM O l O t T!S
« % ; 1 >\h
1 2 3 4
F 3 2 1 1269 1902 310 9
B 189 429 436 520
Hz
Hz
(b) First order Preemphasis method
Fig 5.23 Area Functions, Fonnants and Band widths for vowel M
72
PUj '*■». t- » < <Ujorv
« ' ' /.« ' l - . \ ' V NORMAL 1Zf D DliUNCt oc ' V« ' *!<c 'fRCM olottig i!o: ' .lot
1 2 3 4F 4 14 1607 2 19 2 338 1 HzB 81 18 5 283 296 Hz
(a) C G I method
1 2 3 4
F 362 1724 2265 314 6 HzB 1 1 9 334 527 303 Hz
(b) First order Preemphasis
Fig 5.24 Area Functions, Formants and Bandwidth« for vowel /ae/
73
separate headings i.e. inherent properties of the L P C analysis, and the
corresponding effect of source tract interaction.
5.5.3.1 Autocorrelation v s. Covariance Methods
For comparable frame lengths, of the order of two to three pitch periods, both the
covariance and autocorrelation method yield similar results. In general long
windows for the autocorrelation method provide poor time resolution, with
variation in formants smeared or averaged out. Shorter windows have bad
frequency resolution, and are not recommended. For shorter frame lengths, it is
acknowledged that the covariance method is the most accurate, although instability
problems may arise.
5.5.3.2 Source-Tract Interaction
The perceptual significance of source-tract interaction due to the supraglottal
pressure, as discussed in Section 2 .3.1, is the subject of much discussion. Its
effects on asynchronous and pitch synchronous analysis are quite pronounced, with
formant shifts and significant formant damping, which in turn affect the glottal
waveform extracted. It is generally agreed that the main effect of source tract
interaction is a widening of bandwidths, particularly of the first two formants.
This is apparent from examining the spectra obtained from the two methods being
compared.
For the above reasons, C G I analysis is preferred. However it is claimed by
Holmes [39] that even for this method, source tract interaction cannot be
discounted completely, and base line drift, for example, can result in significant
formant and bandwidth errors. This would explain the need for the interactive
vocal cord model described in Section 2.3.1. According to Anath et al. [40],
however, it can be discounted except in cases of vowels with a very low first
5.5.3 Explanation o f Results
To understand the results, it is necessary to consider the analysis under two
74
formant This is attributed to [41] the fact that the assumption of plane wave
propogation breaks down at frequencies below about 300Hz, and non-negligible
interaction takes place. This is borne out by the glottal waveform for the vowel
N , shown in Fig 5.25 which contains remains of formants in some cycles.
However, the C G I waveform will still be more accurate than the preemphasis
method. Due to this interaction, the detection of the C G I region is more difficult
for vowels with low first formant. This is shown in Fig 5.26, which is the
N M SE waveform obtained for the vowel /u/. Here, the C G I could not be
obtained easily from manual inspection. However, the method for extracting the
C G I proposed here resulted in the ripple free waveform of 5.14b, because its
general location was predefined in the algorithm.
Fig 5.25 Glottal Waveform (CGI analysis) for the vowel N
75
CHAPTER 6 ARTICULATORY SPEECH CODING
6.1 IntroductionMethods for using the closed glottal interval analysis described in the previous
chapter, in conjunction with the A S Y synthesiser, to develop an articulatory coding
system are discussed. This includes a general discussion of speech coding, in
particular articulatory coding, and a comparison of quantization methods, including
an analysis of suitable distortion measures. A method for generating an
articulatory codebook using existing techniques is presented, along with a procedure
for extending it to a linked codebook with the acoustic domain. The limitations
of the method are then discussed, and a new proposal for generating an optimum
articulatory codebook is investigated. Finally, the design of a glottal codebook is
discussed.
6.2 General Speech Coding System
A typical speech coding system is shown in Fig 6.1. The input speech, s(n), is
analysed, and a set of parameters, x(n), are extracted.
INPUT
SPEECH
<o> TRANSMITTER
c(n) ^ DECODERy < n > <^ SYNTHESISER
RECONSTRUCTED
<b> RECEIVER
Fig 6.1 Speech Coding System
These continuous amplitude signals are then quantized to y(n) and encoded into
transmission parameters c(n) before being sent over a communications channel. At
the receiver, assuming a noise free channel, the signal, c(n), is decoded and the
speech is reconstructed from the resulting parameters using a speech synthesiser.
6,2 jpuanflzationQuantization is the process whereby continuous amplitude signals are converted to
discrete amplitude signals, i.e. from above
y = Q(x) ( 6 . 1 )
where Q is the quantization transformation. There are a certain number of
allowable levels, dependent on the bit allocation for each parameter, or set of
parameters. There are two main types of quantization:
(i) Scalar: In this case each parameter from a set is quantized
independently.
(ii) Vector: Here, each set of parameters, known as a vector, is quantized
as a block, and is represented by a single symbol. This can result in
great reductions in bit rates, and is the type to be considered here.
TOCHANNCL
Fig 6.2 Vector Quantization
78
The application of Vector Quantization (VQ) to speech coding was first
investigated by Gray [42]. The usual process of V Q in speech coding is shown
in Fig 6.2. There are two main steps. First a group, or codebook, of
representative vectors is generated from a large set of training vectors. This
codebook should be the best possible representation of the entire space of vectors,
and should be designed to minimize the overall quantization distortion. It is
generated by partitioning the vector space into
L = 2B ( 6 . 2 )
partitions, where B is the number of bits available to represent each vector. Each
partition is known as a cell, i.e. the partition,
P = { C i ; 1 £ i ^ L } ( 6 . 3 )
defines L regions (clusters), each represented at its centre by its centroid
(template). Thus each vector of the training set is allocated to a particular
cluster, which is chosen so as to minimize the overall quantization error between
the training data and templates. Many iterative clustering algorithms are available
[43,44].
Once the codebook of templates has been generated, each input vector is quantized
by searching the codebook to find its closest match, according to a suitable
distortion measure.
The main advantage of V Q is that once a suitable vector (of arbitrary length) is
chosen from the codebook, a symbol to represent this vector is transmitted from
the encoder, rather than the vector itself. Thus a significant saving in bit rates is
achieved. The vector is then reconstructed from the symbol at the receiver by
the decoder. Using lower bit rates carries the penalty of lower fidelity. So
distortion minimization is very important.
6.3.1 Vector Quantization
79
V Q is especially useful when the vector parameters have statistical
interdependence. It has been shown [45] that, in these cases particularly, it is far
more economical than scalar quantization.
6.4 Distortion Measures
Establishing a suitable distortion measure is of major importance in coding L P C
parameters, as it is used in both the codebook generation stage, and in the
quantization of the speech to be transmitted.
Distortion measures used in coding have the following properties:
(i) It must be easy to compute
(ii) It must be easily analysed
(iii) It should be subjectively relevant i.e. differences in distortion values
Three most common categories of distance measures will be defined here, to be
enlarged on in Sections 6.4.1 - 6.4.3. In all cases, the vector dimension is
denoted as M, corresponding to the order of the prediction filter.
(i) The most mathematically obvious, and also the most common measure
should indicate similar variations in speech quality.
is the mean squared error. This is defined as
d ( x , y ) - H - C* - y)T (X - y) ( 6 . 6 )
2 (6.7)
It can be applied to many coefficient transformations as discussed
below.
(ii) A weighted mean squared error distortion measure is sometimes used,
in order to emphasise certain parameters which are more relevant.
80
where W is a constant weight matrix.
(iii) A third type of measure is a form of (ii) where the weighting matrix
is variable, dependent on the input vector, x. In this category is a well
known L P C distortion measure, named after its developers, Itakura and
Saito [48].
6.4.1 Distortion Measures based on the Mean Squared Error
The distortion measures considered here are based on the predictor coefficients,
and corresponding transformations.
(i) The simplest L P C distortion measure is the straight forward difference of
predictor coefficients, {a}. Thus, if [a] is an approximation to {a}, the distortion
measure is defined as:
1 M 2d( a ,a ) = 4 - Z <a i * ( 6 . 9 )
i=1 1
(ii) Problems with instability may arise when quantizing the predictor coefficients.
Interpolation between two vectors of stable coefficients may result in unstable
filters. Also, less than perfect accuracy of transmission can also lead to
instability. For this reason, the P A R C O R coefficients, {k} are preferred. Since,
when stable, they are bounded by unity, it is easy to detect instability, and
interpolation between two sets of P A R C O R coefficients is guaranteed to result in
a stable filter.
For values of {k} approaching unity, the poles approach the unit circle, and small
changes in [k} can result in large changes in the spectrum. Also, while all
81
This is generally defined as
d ( x , y ) = (x - y ) T W (x - y) ( 6 . 8 )
P A R C O R coefficients are bounded by unity, there is a non-uniform distribition of
coefficients over this interval, with the distributions of all except the first two
coefficients concentrated around zero (the first two are close to unity). Thus
uniform quantization is both wasteful and illadvised.
For this reason, P A R C O R coefficients are usually transformed into another set of
coefficients that exhibit lower spectral sensitivity. A particular transformation, and
one which is often used, is that to the log area ratios, (LA R), which have the
property that small changes in the L A R are approximately proportional to
corresponding changes in the log spectrum of H(z). These are defined as follows:
G. = 0 .5 * log1 + k i
1 - k.( 6 . 10)
Thus these parameters are suitable for a uniform quantization by the mean squared
error distortion measure.
(iii) Another coefficient transformation used in L P C is the transformation to the
corresponding cepstrum coefficients. These are spectral parameters, defined as
1
c nlo g I S(a>) I ejnw 00) ( 6 . 1 1 )
2 n
where
S(oo) = H ( z ) I ( 6 . 1 2 )z = eJ
It can be shown [46] that the appropriate L P C transformation, which obtains the
cepstrum coefficients of the L P C derived spectrum envelope from the predictor
coefficients is
ra-1
c = - am - Z (m/n) c a 1 à i é M ( 6 . 1 3 )m m i n m-nn=l
82
It has been noted by Shirai [47] that the lower order cepstrum coefficients show
the global spectral shape. For this reason, they can be omitted from the
distortion measure to emphasise the matching of the pole structure. Thus a
modified distortion measure for the cepstrum coefficients is
1 ^ 2 d ( x , y ) = -n jj- I (x . - y . ) n > 3 ( 6 . 1 4 )
i=n
6.4.2 Distortion Measures based on the Weighted Mean Squared Error
An example in this category is a weighted measure of formant and bandwidths
differences. They are first normalized with respect to average formant values, and
extra weighting is put on matching the first three formants in particular. Another
example, using articulatory parameters will be discussed in Section 6.6.3
6.4.3 Itakura - Sato Distortion MeasureSince the Itakura-Saito distortion measure, which is based on the vector position in
parameter space, was first postulated [48], it has been used extensively in vector
quantization of speech. From Fig 6.1, it can be seen that there are two sources
of error introduced in a speech coding system, that introduced during L P C
analysis, and the error due to quantization. L P C analysis is designed to minimise
the residual (error) energy. Thus ideally the quantization step should also
minimize this error. Using statistical principles, Itakura showed that the log
likelihood ratio can be expressed as the log of the ratio of prediction residuals.
Given a segment of speech, X , with estimated predictor coefficients, {a}, a
distortion measure between X and a template [a] which is a centroid from the
codebook, is sought. The log of the conditional joint probability density, known
as the log likelihood ratio, is denoted as
log [ p ( X , a ) ] = L ( X , a ) = L ( a , a ) ( 6 . 1 5 )
It can be shown [48] that this ratio may be reduced to a powerful distortion
measure,83
i.e.
d(a,a) = loga V a
(6.16)
V is the correlation matrix obtained during L P C minimization. Usually the
distortion measure is defined for the autocorrelation method of linear prediction, so
that
v ( i ) =1
N-i- r r - Z x ( n ) x ( n + i ) 1 i i £ M
w n =l( 6 . 1 7 )
Numerical methods are available for simplifying this distortion measure, in order
to save on matrix multiplication. It is noted that the correlation matrix is a
by-product of L P C , and does not have to be recomputed.
The corresponding matrix for the covariance method is the covariance matrix, as
already discussed in section 4.3.3. The Itakura-Saito distortion measure is not
symmetric, since for any two vectors, u and v, V u * V v . Because of this an
asymmetrical distortion measure is proposed, such that
d ( X , a ) = 0 . 5 * ( d ( a , a ) + d ( a , a ) ) ( 6 . 1 8 )
i . e .
d ( X , a ) = 0 . 5 * ( l o g a V a loga V a
a V a J
) ( 6 . 1 9 )
where V ’ is the corresponding matrix obtained from (a). This matrix must be
stored for each codeword.
6.5 Articulatory Sneech Coding System
In Fig 6.1, the type of synthesiser used at the receiver determines the type of
84
parameters used for coding. In this research, an articulatory synthesiser Is used,
hence the articulatory parameters must be obtained from the speech signal, both
for on-line coding and codebook generation. This will be discussed further in
Chapter 7, suffice to say here that it is a non-trivial problem, which has no
simple solution. It is certainly impossible to do on-line.
To overcome this problem, the idea of a linked codebook i.e. a look-up table of
acoustic parameters matched to corresponding articulatory parameters is proposed.
The procedure for generating such a codebook is shown in Fig. 6.3.
Fig 6.3 Construction of Linked Codebook
First, the articulatory space is sampled in a representative manner to generate an
articulatory codebook. Thus centroids are obtained in the articulatory domain.
Once the articulatory codebook is generated, speech is synthesised from each
centroid, and the corresponding acoustic parameters extracted. This is the basis of
the linked codebook, or look-up table. Thus once this linked codebook is created,
speech coding is carried out, as shown in Fig 6.4. Acoustic parameters are
extracted from the incoming speech, and the closest match in the acoustic part of
the codebook is found using a suitable distortion measure. The corresponding
articulatory parameters are found in the look-up table, and these are encoded for
transmission.
LINKED CODEBODK
Fig 6.4 Articulatory Speech Coding System
6.6 Generation of the A rticulatory Codebook
The procedure for sampling the articulatory space is more difficult than for usual
codebooks, particularly as centroids obtained in the acoustic domain do not, in
general, coincide with centroids on the articulatory domain. Two methods are
proposed here, the first of which is used for evaluating distance measures.
6.6.1 Sensitivity analysis, method
In this method, a form of which has been used by Schroeter et al. [49] for
articulatory coding, a training sequence is not used. Instead the positions of
certain key features (in this case vowels) are obtained in the articulatory domain,
and interpolation carried out between these positions to generate a codebook of
shapes. The procedure carried out here is based on a sensitivity analysis of the
articulatory parameters of A S Y, carried out by Kuc et al. [50] for vowel
recognition applications. It was observed from analysis that the order of
sensitivity of the articulatory parameters, in order of greatest to least sensitivity
was as follows: jaw angle, tongue body coordinates, lip coordinates, tongue tip
coordinates and hyoid, all as defined previously in Section 2.4.1. For quantization
86
four extreme vocal tract shapes in the articulatory space were identified by the
vowels /a/, M, /u/ and /ae/. The range of each parameter between these extreme
shapes was established, and divided into equal increments, the number of
increments depending on the sensitivity. The number of increments were as
follows: jaw angle(lO), tongue body, C(length = 8,angle = 8), lips, L(height = 4,
protusion = 2), tongue tip, T(length = 0, angle = 3), hyoid, H(0).
Those parameters which were not quantized were set at their mean values obtained
from the four extreme positions. This quantization produced 15,360 discrete vocal
tract shapes.
6.6.2 Limitations of Sensitivity Analysis method
While the sensitivity analysis method of generating the articulatory codebook
should sample the articulatory space quite representatively, there is no guarantee
that this codebook is the optimum one. No clustering is done as such, because
there is no training data used, so no distortion is measured. For vector
quantization the number of bits used to describe a vector is usually about B =
10. So the average codebook size is
L = 2 1 0 = 1024 ( 6 . 2 0 )
Codebook sizes greater than this are usually too large to handle, in terms of both
storage costs and particularly computational considerations e.g. codebook searches.
Attempts to reduce the number of shapes have been tried by proportionally
reducing the number of increments used. This only reduces fidelity even more.
The basic problem is that the codebook is generated purely from synthetic shapes.
For a codebook of size L , generated by clustering, the number of training data
vectors is recommended to be ^ 50L, which for this case would be the order of
50,000 vectors. The ideal situation would be to use real speech to generate the
codebook, which returns the problem to that of estimating the articulatory
parameters from the speech wave.
87
& & 2-Ia iD in g set method
A method has been proposed by Shirai et al [47] for estimating the articulatory
parameters from the speech wave, using a non-linear iterative procedure. It has
been used successfully for Japanese vowel recognition. This procedure will be
described in Chapter 7.
Using this method, applied to the A S Y synthesiser, a large training set of
articulatory parameters could be obtained directly from the speech wave. These
articulatory parameters would then be clustered in the articulatory domain, using a
suitable distortion measure. The distortion measure proposed here would be a
weighted mean squared error of articulatory parameters, with the weights
determined from the relative sensitivity of each parameter, as outlined previously.
So centroids are obtained in the articulatory domain, speech is synthesised from
these, and the corresponding acoustic parameters are obtained. These parameters
are then used to construct a linked codebook in the same manner as the
sensitivity analysis method.
6.7 Linked Codebook Generation
The generation of the linked codebook involved synthesising speech from the
15,360 shapes obtained from quantization of the articulatory space, as discussed in
Section 6.6.1. For this, A S Y , which is essentially an interactive synthesiser, was
converted for use as a subroutine. A S Y was written in Fortran, so this involved
interfacing C and Fortran subroutines. For each articulatory shape, a speech
segment of approximately 20ms long was generated, using default source
parameters, i.e. open quotient of 0.5, speed qoutient of 3.0, and pitch of 100Hz.
A S Y generates speech at 20Khz, so this speech had to be down-sampled to
7.5Khz (the sampling rate of the incoming speech), and then analysed using the
closed glottal interval analysis of Chapter 5.
Initially, it was proposed to compute the transfer functions from for the look-up
88
table directly from the cross-sectional areas, as described in Section 3.3. However
it was felt that the acoustic parameters of the model and of the real speech
should be extracted under identical conditions, otherwise the results would be
inconsistent. Only the first four formant frequencies could have been compared
directly, as the L P C transfer function does not correspond to the transfer function
of A S Y , which as well as incorporating losses, actually extracts a much larger
number of formants, due to the number of cross-sectional areas used (greater than
twenty). By using the same method for both, limitations of the L P C analysis are
inherent in both sets of parameters, and are effectively cancelled out.
The linked codebook generated from an articulatory codebook generated from using
a training set of real speech data would be generated in a similar manner, with
speech being synthesised from the centroids obtained from articulatory clustering.
6.8 Evaluation of Distortion Measures
A study of the discussed distortion measures was undertaken for various vowels.
The closest match, according to the minimum distance for each type of measure,
was found from the acoustic part of the codebook described above.
An example of the chosen vectors from the codebook, and their corresponding
area functions are shown in Fig 6.5, for the vowel /a/. In general, all the
distortion measures yielded similar results. The closest matches obtained were
from the log area ratio and covariance measures, whereas those obtained from the
predictor and cepstrum coefficients were not as accurate. The high order cepstrum
coefficients produced the same results as the ordinary mean squared error cepstrum
measure. The Itakura-Saito autocorrelation measure yielded the worst results, as
would be expected, since the L P C autocorrelation method was not used. The
formant and bandwidth measures were not very consistent either. The difficulty
with this method is choosing the right weights, and in general methods which
essentially use normalized parameters, such as the predictor coefficients, are
preferred.
89
(a) Actual match required
(b) Italcura-Saito Method
Fig 6.5 Area Functions obtained from various distance measures
90
(c) Covariance Matrix Method
(d) Log Area Ratio Measure
Fig 6.5 Area Functions obtained from various distance measures
(e) MSE of Cepstrum Coefficients
( f ) MSE of predictor coefficients
Fig 6.5 Area Functions obtained from various distance measures
92
Fig 6.5 Area Functions obtained from various distance measures
It must be said, however, that a true evaluation could only be done in a full
speech coding system, by comparing the results for large amounts of parameters,
and by listening to the transmitted speech. However, for the reasons discussed in
Section 6,4,3, it is likely that the covariance measure, developed along the same
lines as the Itakura-Saito distortion measure, should yield the best results. The
log area ratios would also be very useful, particularly for articulatory applications,
as it is concerned with minimizing the error in vocal tract shape.
Another problem with the analysis is that as A S Y generates speech at a 20Khz
sampling accuracy, down-sampling this will obviously have a detrimental effect.
A S Y may be converted easily enough to output speech at lOKhz, and sampling
rates below this are not recommended for articulatory synthesis, as its advantages
over L P C synthesis would no longer be apparent. So, it is suggested, for future
93
work, that the input speech should also be sampled at lOKhz, and the filter
order for L P C analysis increased to 10. Because of the relatively low sampling
rate, higher formants (above 3500Hz) will not be detected, and sometimes only
three formants will be extracted. This is another reason why direct comparison of
formants and bandwidths was not very successsful in this analysis.
6.9 Glottal Codebook Design
The design of a glottal codebook to best represent the four source parameters, i.e.
pitch, amplitude of glottal pulse, open quotient and speed quotient, is
straightforward. The parameters are extracted from the glottal waveform obtained
from the algorithm for C G I analysis (Fig. 5.10), using the algorithm described in
Section 5.4 (Fig. 5.11). A fairly crude quantization should suffice, e.g. 3-4 bits
per parameter, if scalar quantization is used. If vector quantization is used, lower
bit rates should be possible, although the fact that the vector parameters are
essentially statistically independent of each other may mean that the extra
computation and resources required for codebook generation is not worthwhile. If
vector quantization is chosen, an example of an appropriate clustering algorithm
would be the modified K-means algorithm [42].
94
T H E S P E E C H W A V E
CHAPTER 7 ESTIMATION OF ARTICULATORY PARAMETERS FROM
7 .1 Introduction
A general discussion of the inverse problem of the vocal tract is presented here,
followed by a particular method (Shirai’s [47]), for which an algorithm is derived.
Analysis conditions are discussed at length, and possible reasons for the
disappointing results are presented. Ideas for improvement are then proposed,
including the need to try a wide range of methods.
7.2 Inverse Problem of the Vocal Tract
Let {y} be an M - dimensional vector which represents the acoustic parameters
of a speech wave, and {x} an N - dimensional vector to represent its
corresponding articulatory parameters. The acoustic parameters may be expressed
as a function of the articulatory parameters
y = h ( x ) ( 7 . 1 )
where h(x) is the vocal tract function. The inverse problem
x = h ’ ! ( y ) ( 7 . 2 )
of the vocal tract is thus defined as the problem of estimating the articulatory
parameters from the acoustic parameters, or from the speech wave.
Direct methods for determining the vocal tract shape from the speech wave are
available, e.g. Wakita’s method [20], discussed in previous chapters. There are
inherent problems with these methods. As discussed in Section 3.2, a lossless
tube model of the vocal tract is assumed, with ideal boundary conditions for the
glottis and lips. This approach leads to ambiguity in that two different area
functions can represent the same vocal tract transfer function, depending on the
imposed boundary conditions. In addition, no analytical solution exists for the
95
In reality, h(x) is a non-linear function of (x). There are two main sources of
this non-linearity. Firstly, the fundamental ambiguity, known as the ’ventriloquist
effect’ is apparent, where different vocal tract shapes can produce an identical
transfer function. Secondly, the articulators themselves impose natural constraints
on the vocal tract shape.
Because of the absence of analytical solutions for the lossy case, two alternatives,
based on numerical methods, are usually investigated, i.e. regression analysis and
constrained optimization.
7.2.1 Regression Analysis
The first is a direct approximation approach, i.e. representing h'l(y) as a
combination of simple functions e.g. piecewise linear or polynomial. These
regression techniques have been shown by Atal [51] to give good results, provided
that enough non-linear terms are contained in the approximating function. A
training set of (x,y) data is required for this method, which is obtainable from the
model itself.
7.2.2 Constrained Optimization
The second approach is that of an non-linear optimization of parameters, based on
the minimization of the error between the model output and the measured data.
This approach is usually preferred, as it should yield more accurate results, and is
more subjectively meaningful in relation to the non-linear estimation problem.
Also it is based on real speech, rather than synthetic, as in Section 7.2.1.
The inverse problem, as discussed in this research, is therefore a non-linear
optimization of parameters under a certain criterion, and must be solved iteratively.
Unfortunately, this transformation is ill-posed i.e. known problems are uniqueness
of solution and stability of the convergence. To convert it to a well-posed
problem, constraints must be imposed on the range of the articulatory parameters
lossy case.
96
and an appropriate initial estimate should be obtained. Many optimization
algorithms have been proposed [52-55], basically differing in the type of
constraints imposed and choice of initial estimate. The method chosen here was
that of Shirai et. al [47], because, as well as imposing comprehensive constraints
on the range of the articulatory parameters, it has claimed to yield excellent
results for Japanese vowel recognition.
7 .2 - SMraTsJ flgfliPd
This method is essentially a modified version of the Newton-Raphson formula
i+ l i f(X) n ™x = x + --------1— ( 7 . 3 a )f (x)
which may also be written as
i+ 1 ix = x + X f ( x ) ( 7 . 3 b )
where X is a convergence parameter, which controls the speed of convergence.
Constraints on the direction of convergence are added to this basic formula.
Let {y} be the acoustic parameters measured accurately from the speech wave.
These parameters may be spectral parameters such as cepstrum coefficients or L P C
parameters. For each frame, the best estimate {x} of the articulatory parameters
is obtained so as to minimise the cost function
2 2 2
J ( x ) = iy - h ( x ) | + i x i + ix - 1 i ( 7 . 4 )R Q r
where R, Q and T are weighting matrices and 1 is the articulatory estimate of the
previous frame. R is an M x M matrix, Q is an N x N matrix , and T is an
N x N matrix. The notation above results in a scalar distance measure for each
group of vectors.
97
The first term is the weighted square error between the measured acoustic
parameters and those of the model. The second term restricts the deviation from
the neutral position, while the third term restricts the deviation from the estimate
of the previous frame.
It can be shown [47] that the solution minimising the function J is obtained by
the following iterative form:
i + 1 i _ix + A,. 5x ( 7 . 5 )
where
8x8 h ( x k )
SxR
8h ( x k )
Sx+ r
-l
8h(xk )
5xR ( yk - h ( x k ) ) - Q xk + r <i - *k> ( 7 . 6 )
where i is the iteration number, and k the frame number. Xj, the parameter
which is used to monitor the speed of convergence, can be changed as the
iteration proceeds. It is often called the stepsize parameter or weighting
constant.
Eqn (7.5) and (7.6) together are of the form
i + 1 i ,x = x + A,. { B A } ( 7 . 7 a )
where A and B are matrices. In terms of scalar parameters, this can be written
in the form
98
i+1 Ax X + (7.7b)
B
As A contains the term (y - h(x)) and B contains the derivative of h(x), this is
in the same form as eqn. (7.3a).
The term (8h(Xj) / 5xj} is a partial derivative with respect to each articulatory
parameter xj in (x). Thus since h(x) is an M xl matrix, its derivative is an M xN
The derivative of h(x) cannot be obtained analytically, so it is calculated by
getting small changes around {x}, i.e. h(x + *x). The weight matrices can be
varied as the iteration proceeds.
7.4 Application of ShiraTs Method
A block diagram of the analysis procedure is shown in Fig 7.1. It comprises
three main parts: an articulatory synthesis algorithm, a spectral estimation
algorithm, and a minimization algorithm. The input speech, s(t), is analysed and
its acoustic parameters, y, extracted. From an initial estimate, x, synthesis is
carried out by the A S Y synthesiser, using default glottal parameters, as was done
in Chapter 6. The synthetic speech s ’(t), the approximation to s(t), is analysed to
extract its acoustic parameters, h(x), using the C G I analysis of Chapter 5. These
5h ( 7 . 8 )
99
acoustic parameters are then compared to y using a specific error criterion, as
discussed in the previous section. From this comparison, a new estimate of the
articulatory parameters is obtained, and the iteration proceeds until the acoustic
parameters of the model are sufficiently close to the real speech. The first two
algorithms have been covered elsewhere, so the discussion here is mainly
concerned with the minimization algorithm, and analysis considerations.
D E F A U L T G L O T T A L PARS.
x + d x
-N l/ .A R T I C U L A T O R Y
S Y N T H E S I S
<ASY>
L P C ( C G I )
A N A L Y S I S
dx
h<x)
/ \s<t>-
>
L P C ( C G I ) ER RO R
A N A L Y S I S / M INIMIZATION
\ ✓
Fig 7.1 Adaptive Estimation Algorithm
7.4.1 Minimization Algorithm
The third section, the minimization algorithm, is basically an implementation of
eqns, (7.5) and (7.6), i.e. the appropriate change in the articulatory parameters is
computed, and they are then changed accordingly for the next iteration. For the
implementation of this algorithm, various factors had to be considered.
100
7.4.1. 1_ Adaptation of-Menflelsteinla ModelThe estimation algorithm was originally developed for Shirai’s articulatory model
[56]. This model, which is based on the statistical analysis of real data,
automatically incoiporates physiological and phonological constraints, making it
possible to represent each articulatory position accurately using a minimum number
of parameters. Only six parameters are used in all, compared to ten for the
Mermelstein model.
Mermelstein’s model, also developed from real X-ray data, should incoiporate
natural constraints, and therefore should be suitable for this type of analysis. In
order to reduce the number of parameters, average values were taken for four
parameters i.e. the hyoid (X and Y ) coordinates, tongue tip length, and nasal
parameter (velum). This was justified by the sensitivity analysis discussed in
Chapter 6. In addition, the four extreme positions established for vowels in
Chapter 6 were used as constraints, and the articulatory parameters were
normalized with respect to these within the range -1 ^ x 4 1. This normalization
was done mainly because all the parameters of Shirai’s model were bounded by
unity. It also made it easier to see the deviation from the neutral condition (x=0),
a condition of the minimization procedure, as well as helping to ensure that the
articulatory parameters did not extend outside the admissable range. It also aided
in the computation of the derivative of h(x), for determining a small change in
{xj, and in the determination of the weighting matrices values.
7.4.1.2 Choice of Initial Estimate
In Shirai’s method, the importance of an initial estimate is stressed, as the
stability and speed of convergence are very much dependent on this. Apart from
the first frame of speech, the initial estimate is taken as the estimate of the
previous frame, which as well as being the most likely position, also ensures
continuity in vocal tract shapes. For the first frame however, the starting value is
obtained using a piecewise-linear estimate, obtained from regression analysis, as
101
discussed in Section 7.2 .1. For the A S Y model here, it was decided to look up
the linked codebook, generated in Section 6.7, to first find the closest acoustic
match, according to the covariance measure, and then the corresponding articulatory
position.
7.4.1.3 Choice of Weighting Matrices
From Shirai, the matrix R was taken as the identity matrix, to give equal
weighting to each acoustic coefficient. The choice of the other two diagonal
weighting matrices, Q and T, was somewhat arbitrary, but from examination of
the values used for Shirai, it was decided to base them loosely around the relative
sensitivity of each articulatory parameter. From Shirai, it was observed that, in
general, the choice of weights corresponding to each parameter was inversely
proportional to its sensitivity. For example, the N x N matrix Q is taken to be
Q =
0 . 1 2 5
0.0IIIII
0.0
0 . 1 2 5
0.0
0 . 3 3
0.1
0 . 5
( 7 . 9 )
0.0 0 . 2 5
where the articulatory parameters are, in order, tongue centre, tongue angle, tongue
tip angle, jaw, lip protrusion and lip height. The values correspond directly to
the inverse of the number of increments used in Section 6.6.1 for computing the
articulatory codebook. The matrix T, which weights the change in articulatory
parameters between frames, is similarly derived, however it is given more
emphasis than Q.
7.4.1.4 Computation of derivative of h(x)
As discussed earlier, the derivative of h(x) cannot be obtained analytically.
102
Therefore, a small value of was taken i.e. 0.01 (normalized). In addition, the
derivative was taken so as to avoid extreme values, particularly as initial estimates
were at the extremities for some articulatory parameters. Therefore, in order not
to go outside the range, 8h(x) was defined as
S h (x ) = h ( x + ^x) - h ( x ) - 1 ù x i 0 ( 7 . 1 0 a )and
8h ( x) = h ( x ) - h ( x - * x ) 0 à x à 1 ( 7 . 1 0 b )
For each articulatory vector estimate, this derivative was computed for each
direction in the articulatory space, and the corresponding change in each acoustic
parameter was obtained, resulting in the M xN matrix of eqn. 7.8.
7.4.1.5 Choice of Acoustic ParametersThe acoustic parameters recommended by Shirai were the cepstrum coefficients as
a change in the cepstrum coefficients is equivalent to a change in the log spectra,
from their definition.
7.4.1.6 Convergence Criteria
The value of the convergence parameter is critical, as discussed above. As there
were no strict guidelines, apart from a general limit of 1.0, it was decided to
make it very small i.e. 0.05 initially, to avoid divergence and instability. As
regards the definition of convergence, a value depending on the total mean
squared error of the cepstrum coefficients was proposed, i.e. 0.1. Convergence was
also governed by a maximum number of iterations i.e. 100.
7.5 Analysis Results and Possible Improvements
What is obvious from the preceding discussion is that there is a large number of
choices and variables in this algorithm. It proved no trivial matter to establish
optimum conditions for the convergence, and in fact no convergence could be
obtained. This was not very surprising for many reasons.
103
Perhaps the most important factor is that the mean vocal tract length should be
matched initially for each speaker. Methods for removing the effects of individual
speakers have been proposed [56], which should be investigated. This
normalization would also have an effect on the determination of the weighting
matrices. A fairly large number and cross-section of speakers would be required
to do this.
Another problem could be the acoustic distance measure used in the algorithm.
In Chapter 6, it was found that the best possible distortion measure was the
covariance measure. This would essentially involve replacing a weighting matrix
with the variable covariance matrix, although a new algorithm would need to be
derived, as this measure would not exactly fit in with its present form. Another
idea, which would be easier to implement, would be the log area ratio measure,
also recommended in Chapter 6.
The ratio of the acoustic to articulatory parameters, M :N , should also be
considered. For uniqueness of solution, it is recommended that the number of
acoustic parameters should exceed as much as possible that of the articulatory
parameters. The current ratio is 8:6. B y increasing the sampling rate, or
alternatively by using linear combinations of the acoustic parameters, this ratio
could be increased. In fact, Shirai used twelve cepstrum coefficients in his
analysis.
Finally, it is obvious that an independent study of general convergence techniques,
beyond the scope of the current research, would be very beneficial. Other
algorithms exist, notably those of Chaipentier [55], Levinson et al. [54] and Atal
et al. [52]. Atal uses a table search, followed by optimization, similar to that
proposed here. Levinson uses an unconstrained optimization, starting at the neutral
shape. Using the neutral shape as an initial estimate was tried here, both for a
general unconstrained optimization and the method described above. However, the
104
results were no better. Charpentier, on the other hand, also uses a table search,
but in the table construction, incorporates an analysis of curvature, and
concentrates on the highly non-linear regions. This would be a possibility here,
for obtaining a more accurate initial estimate, as it was found that the initial
articulatory estimate was very dependent on the distortion measure used to extract
it from the look-up table.
105
CHAPTER 8 DISCUSSION, IMPROVEMENTS AND CONCLUSIONS
£J Convergence Techniques
The complete implementation of an articulatory vowel vocoder is outside the scope
of this thesis. The main reason for this is the difficulty encountered in the
generation of an articulatory codebook using a training set of real speech. This
involved using convergence techniques to solve the inverse problem of the vocal
tract, as described in Chapter 7. Methods for improving the analysis in order to
achieve convergence were already discussed extensively there, and will not be
discussed here. However, it must be said that a large amount of data from
various speakers would need to be analysed, in order to establish optimum
conditions. Apart from the fact that this was not readily available, each iteration
of the convergence requires a synthesis from an articulatory estimate, and a full
C G I analysis. This is very computationally intensive, and would not be possible
with available resources.
&2 Bit Rates
The emphasis in this thesis has been on improving the quality of speech while
still retaining low bit rates. Once a linked articulatory-acoustic codebook is
generated, the number of bits required to represent each shape vector would be
comparable to conventional coding of L P C parameters, i.e. for a codebook of say,
1024 shape vectors, 10 bits would be required. As the vocal tract changes
slowly, the parameters would only need to be updated, for example, every third
pitch frame, as already discussed in Chapter 5. As discussed in Section 6.9, the
glottal parameters could be quite crudely quantized, say a maximum of 4 bits
each, if scalar quantization was used. Assuming a parameter update every 20ms
approximately, bit rates as little as 1200b/s could be achieved theoretically. Of
course, the effect of vector quantization on speech quality would need to be
investigated.
106
8.3 Resources and Computation
An example of the resources required for clustering can be seen in the work
carried out by Schroeter et al. [], where it took 1000 hours of CP U time on a
super-minicomputer to cluster 10,000 shapes, using the modified K-means
algorithm. Sub-optimal solutions could also be found, for a lot less computation,
but it is obvious that a huge amount of overall computation is required for the
clustering and iteration processes. Fortunately, all this is only done on a once-off
basis. More of a problem is the amount of computation involved in the C G I
analysis. Methods for reducing this were discussed in Chapter 5, including the
need to update the covariance matrix only one row at a time for sequential
covariance analysis. Also, since the algorithm for extracting the closure region
only examines the normalized mean squared error from the first half of the pitch
frame, the sequential analysis does not need to be done for the whole frame, thus
reducing computation considerably. The effect on quality of only analysing every
third pitch frame for parameter extraction instead of picking the best of three
could also be examined. A ll these factors contribute to making the vocoder a
viable proposition. The possibility of converting A S Y to run in real-time, should
also be investigated.
8.4 Recording Conditions
A feature of any coder is that it should be robust. However, great care must be
taken in recording speech for C G I analysis. A phase linear system is required, as
otherwise the glottal waveform will be seriously degraded. The most serious
degradation of results is caused by analogue tape distortion. This can be
overcome by digitizing the speech directly, and with the advent of digital tapes, it
may not be a problem in future. In addition, a low noise microphone with good
low frequency resolution is required (e.g. an electret microphone). Ambient noise
should be kept to a minimum.
Methods for compensating for phase distortion have been tried, with some success.
107
Veeneman et al. [57] used a prerecorded calibration signal to characterize the
recording system, and hence design compensating filters. For analysing the true
effect of distortion on both the glottal waveform and the transfer function
obtained, equipment is required which would record the glottal opening and
closing, e.g. an electroglottograph, (EGG).
Sampling EatsThe advantages of using a higher sampling rate, e.g. lOKhz, for articulatory
coding were already discussed in Chapters 6 and 7. Increasing the sampling rate
should also help the C G I analysis, as there will be more speech samples in the
closed glottis interval. Thus, the analysis interval length could also be increased,
which would in turn increase stability, as in general, the longer the interval for
covariance analysis, the less likelihood of instability. This would be particularly
useful in areas of slight constriction in the vocal tract, where the C G I analysis
would not be expected to work as well, as the assumption of plane wave
propogation is not as justified.
8.6 Limitations of C G I / Alternatives
Due to the sensitivity of the C G I analysis to the recording conditions, its accepted
failure for high pitched voices (not enough samples in the closed glottis interval),
and also the larger amount of computation involved in comparison to conventional
L P C techniques, it is felt that more research should be done into adaptive inverse
filtering techniques. These techniques would also be more adaptable to other
types of speech sounds than the C G I analysis. However, it was seen that
currently there is a wide difference in results obtained from the two classes of
methods, the C G I analysis being far superior. While it would be expected that
C G I analysis would extend easily to semivowels (i.e. liquids, such as /w/ and /I/,
and glides, such as /r/ and /y/), which are essentially voiced sounds, there is no
way it would work for consonants. A possibility would be to include some sort
of binary voicing decision at the speech input to decide if C G I analysis is
108
appropriate. Then unvoiced sounds could be dealt with using a different analysis.
8.7 Completion of the A rticulatory Vocoder
As mentioned in Section 8.1, the main task still outstanding is to perfect the
convergence technique used to extract the articulatory parameters from the speech
wave. The next step is clustering of the resulting articulatory shapes, using a
reliable algorithm, followed by the generation of a glottal codebook in a similar
manner. Once the codebooks are completed, the overall quality can be assessed,
and the suitability of the proposed covariance distortion measure evaluated.
8.8 Conclusions
A n investigation into articulatory vocoding was carried out in this thesis. In
Chapter 2, the articulatory mechanism of speech production was described in
detail, along with a description of the articulatory synthesiser to be used in the
research. This was followed in Chapter 3 by an acoustic tube model
representation of the vocal tract, and its corresponding transfer function was
derived, using Portnoff’s equations. In Chapter 4 an investigation into Linear
Predictive Coding, a time domain method, which is now the most popular of
speech analysis methods, was carried out. Various methods of L P C were
compared, and algorithms for the solution of the autocorrelation and covariance
method presented. Using the lattice method, a correlation between the reflection
coefficients of L P C and those of the acoustic tube model was derived.
The above results were used in Chapter 5, where methods of extracting the vocal
tract shape from the speech wave were investigated. These methods were
primarily based on modifications of the L P C technique, and were concerned with
removing the glottal and radiation characteristics which are not removed by
standard LP C . Algorithms were first presented for the autocorrelation method,
using four different types of preemphasis. Three types were asynchronous
109
methods, i.e. first order preemphasis, adaptive first order preemphasis and adaptive
multi-order preemphasis. The last was a pitch synchronous method, based on
detecting glottal closure. The methods were compared by applying glottal inverse
filtering to the vocal tract transfer function obtained. The glottal waveform
extracted was examined to see how closely it resembled a smooth idealized pulse,
an indication of the analysis accuracy. Of all these autocorrelation based
preemphasis methods, it was shown that the best results were obtained for the
straight-difference first order method, as all the rest were found to impose too
much preemphasis.
The second type of method was based on the covariance method of L P C , known
as C G I analysis, and was based on the detection of the closed glottis interval
using the normalized mean squared error obtained from sequential covariance
analysis. Drawbacks with existing methods were illustrated, and a new robust
algorithm for detecting the appropriate location was derived, from close
examination of the error waveforms. A method for extracting the glottal
parameters from the glottal waveform was also presented. It was shown that, in
all cases, the C G I method yielded superior results to the first order preemphasis
autocorrelation method. This was apparent in both the glottal waveforms
extracted, and the values of the bandwidths obtained. The smaller C G I
bandwidths agreed more with general trends of measured bandwidths [28].
The C G I method was used in Chapter 6 for the acoustic analysis part of the
articulatory vocoder. In this Chapter, the concept of a linked codebook of
articulatory-acoustic parameters was introduced, and two methods for generating
such a codebook were proposed, one based on synthetic speech, and one based on
real speech. An evaluation of suitable distortion measures was carried out in the
acoustic domain, and it was decided that the best distortion measure to use was a
covariance measure, proposed here as a modification of the existing Itakura-Saito
measure for the autocorrelation method of LP C . The disadvantages of using an
110
articulatory codebook derived from synthetic speech were discussed, and a method
for constructing one based on real speech was proposed.
To construct an articulatory codebook from real speech, a method is required
which extracts articulatory parameters from the speech wave. Chapter 7
investigated an algorithm developed by Shirai for this purpose, and modified it for
the A S Y model. As useful results were not obtained from the algorithm, the
difficulties of such an analysis were detailed. Various improvements were
suggested, which if implemented, should eventually result in a robust algorithm.
To conclude, this thesis proposed an accurate method of extracting useful acoustic
parameters from the speech wave, for use in an articulatory vocoder. A full
design for an articulatory vocoder, with the articulatory codebook based on
synthetic speech, was presented. The need to improve on the estimation of
articulatory parameters from the speech wave was stressed, in order to generate a
codebook from a training set of real data.
I l l
REFERENCES
1. J.N. Holmes, "Formant synthesisers - Cascade or Parallel", Speech Comm.,
Vol. 2, pp. 251-273.
2. B.S. Atal and M .R. Schroeder, "Adaptive Predictive Coding of Speech
Signals", BeU Syst Tech. J.t Vol 49, pp. 1973-1987, Oct. 1970.
3. J.L. Flanagan, "Speech Analysis, Synthesis and Perception", New York:
Springer Verlag, 1972.
4. M. Sondhi and J. Schroeder, "A Non-Linear Articulatory Speech Synthesiser
using both Time and Frequency domain elements", Pioc. TREE ICA SSP,
Tokyo, Japan, Vol. 3, pp. 1999-2002, 1986.
5. F. Fallside and W. Woods, "Computer Speech Processing", Englewood Cliffs,
New Jersey: Prentice Hall 1985.
6. D.H. Klatt, "Synthesis by Rule of Consonant-Vowel Syllables", Speech Comm.
Group Working Papers, Vol. 3, M IT, Cambridge, M A , pp. 93-104.
7. K . Ishizaka and J.L. Flanagan, "Synthesis of Voiced Sounds from a Two
Mass Model of the Vocal Cords", Bell Syst. Tech. J„ V o l 51(6), pp.
1233-1268, 1972.
8. H. Wakita and G . Fant, "Towards a better Vocal Tract Model", Speech
Transmission Lab. Quarterly Progress and Status Report, (STL-Q PSR ), Vol. 1,
R IT , Stockholm, Sweden, pp. 9-29 1978.
9. J.L. Flanagan and L . Landgraf, "Self oscillating source for Vocal Tract
Synthesisers", IE E E Trans. Audio and Electroacoust, V o l A U -16 , pp. 57-64,
1968.
112
10. H. Fujisaka and M. Ljungvist, "Proposal and evaluation of models for the
glottal source waveform", Proc. IE E E IC A S SP , pp. 1605-1608, Tokyo 1986.
1 1 . K .N . Stevens and A.S. House, "Development of a quantitative description of
vowel articulation", JA SA , Vol. 27, pp 484-493, 1955.
12. G . Fant, "Acoustic Theory of Speech Production", Mouton, The Hague, 1970.
13. J.L. Kelly and C. Lochbaum, "Speech Synthesis", Proc. Stockholm Speech
Comm. Seminar, R IT , Stockholm, Sweden, September 1962.
14. C.H . Coker, "A Model of Articulatory Dynamics and Control", Proc. IE E E
Vol. 64, No. 4, pp. 451-460, April 1976.
15. P. Mermelstein, "Articulatory Model for the study of Speech Production",
JA S A Vol. 53, No. 4, pp. 1070-1082, 1973.
16. P. Rubin, T . Baer and P. Meimelstein, "An Articulatory Synthesiser for
Perceptual Research", Haskins Lab. Status Report on Speech Research",
SR -57, pp. 1 -16, 1979.
17. J.S. Perkell, "Physiology of Speech Production: Results and Implications of a
Quantitative and Cineradiographic Study", M IT, Cambridge, M A, 1969.
18. A .E. Rosenberg, "Effect of glottal pulse shape on the quality of natural
vowels", JA S A Vol. 49, No. 2, Part 2, pp. 583-590, 1971.
19. M ,R. Portnoff, "A Quasi-One-Dimensional Digital Simulation for the
time-varying Vocal Tract", M .S. Thesis, Dept, of Electrical engineering, M IT,
Cambridge, M A, June 1973.
20. H. Wakita, "Direct Estimation of the Vocal Tract Shape by Inverse Filtering
of Acoustic Speech Waveforms", IE E E Trans. Audio and ElectroacousL, Vol.
2 1 , No. 5, pp. 4 17-427, Oct. 1973.
113
2 1. J.D. Markel and A.H. Gray Jr., "Linear Prediction of Speech", New York,
N Y : Springer Verlag, 1976.
22. J. Makhoul, "Linear Prediction: A Tutorial Review", Proc. IE E E , Vol. 63, pp.
561-580, April 1975.
23. B.S. Atal and S.L. Hanauer, "Speech Analysis and Synthesis by Linear
Prediction of the speech wave", JA SA , Vol. 50, pp. 637-655, 1971.
24. J. Makhoul, "Stable and Efficient methods for Linear Prediction", IE E E Trans.
ASSP, Vol. 25, No. 5, pp. 423-428, Oct. 1977.
25. J. Durbin, "The Fitting of Time Series Models", Rev. Int’L Statist Inst, Vol
28, pp. 233-244, 1960.
26. N. Levinson, "The Wiener RM S Error Criterion in Filter Design and
Prediction", J. Math. Phys., Vol.25, pp.261-278, 1947.
27. A .V . Oppenheim and R.W . Schafer, "Digital Signal Processing", Englewood
Cliffs, New Jersey : Prentice Hall, 1975.
28. L .R . Rabiner and R.W . Schafer, "Digital Processing of Speech Signals",
Englewood Cliffs, New Jersey : Prentice Hall, 1971.
29. B.S. Atal, "Determination of the Vocal Tract Shape directly from the Speech
Wave", JA SA , Vol. 47 (A), p. 64, Jan. 1970.
30. T . Nakajima, H. Omura and S. Ishizaki, "Estimation of Vocal Tract Area
Functions by Adaptive Inverse Filtering Methods and Identification of
Articulatory Model", Proc. Speech Comm. Seminar, Stockholm, John Wiley
and Sons. 1974.
3 1. S. Murphy, University of Ulster, Jordanstown, Belfast, (Direct communication).
114
32. D .Y Wong, J.D. Markel and A.H . Gray, "Least squares glottal inverse filtering
from the acoustic speech waveform", IE E E Trans. IC A S S P , Vol. 27, No. 4,
pp. 350-355, Aug. 1979.
33. H.W . Strube, "Determination of the instant of glottal closure from the speech
wave", JA SA , Vol. 56, No. 5, pp. 1625-1629, 1974.
34. B.S. Atal and M .R. Schroeder, "Predictive Coding of Speech Signals and
Subjective Error Criteria", IE E E Trans. ASSP, Vol. 27, No. 3, June 1977.
35. J.N. Larar, Y .A . Alsaka and D .G . Childers, "Variability in Closed Phase
Analysis of Speech", Proc. IE E E IC A S SP , pp. 1089-1092, 1985.
36. I. Mallawany, "Area Function extraction over the closed glottal interval",
Articulatory Modelling Symposium, Grenoble, July 1977.
37. H .W Strube, "Can the Area Function of the Human Vocal Tract be
determined from the Speech Wave? ", U.S - Japan Joint Seminar on
Dynamic Aspects of Speech Production, Tokyo University Press, pp. 279-302,
1977.
38. P. Hedelin, "A glottal L P C vocoder", Proc. IE E E IC A S S P , pp. 6-10, 1984.
39. J.N. Holmes, "Formant Excitation before and after Closure", Proc. TFF.F,
IC A S S P , pp. 39-42, April 1976.
40. A.S. Anath, D.G. Childers and B. Yegnanarayana, "Measuring source - tract
interaction from speech", Proc. IE E E IC A S SP , pp.1093-1096, 1985.
41. B. Cramen and L . Boves, "Aerodynamic Aspects of Voicing: Glottal Pulse
Skewing Revisited", Proc. IE E E IC A S SP , pp. 1093-1096, 1985.
42. R.M . Gray, "Vector Quantization", IE E E A S SP Magazine, April 1984.
115
43. J. Makhoul, "Vector Quantization in Speech Coding", Proc. IE E E , Vol. 73,
No. 1 1 , Nov. 1985.
44. Y . Linde, A. Buzo and R.M. Gray, "An Algorithm for Vector Quantizer
Design", IE E E Trans. Com mua, Vol. 28, No. 1, pp.84-95, Jan. 1980.
45. A. Buzo, A.H. Gray, R.M. Gray and J.D. Markel, "Speech Coding based on
Vector Quantization", IE E E Trans. A SSP, Vol. 30, No. 2, pp. 294-303, April
1982.
46. S. Saito and K. Nakata, "Fundamentals of Speech Signal Processing", Florida:
Academic Press 1985.
47. K . Shirai and T . Kobayashi, "Estimating Articulatory Motion from Speech
Wave", Speech Comm. 5, pp.159-170, 1986.
48. F . Itakura, "Minimum Prediction Residual Principle Applied to Speech
Recognition", IE E E Trans. ASSP, Vol. 23, No. 1, pp. 67-72, Feb. 1975.
49. J. Schroeter, J.N. Larar and M.M. Sondhi, "Speech Parameter Estimation
using a Vocal Tract / Cord Model", Proc. IE E E IC A S SP , pp. 308-311, 1987.
50. R. Kuc, F. Tuteur and J.R. Vaisnys, "Determining Vocal Tract Shape by
Applying Dynamic Constraints", Proc. IE E E ICA SSP , pp. 1 1 0 1 - 1 1 0 4 , 1985.
51. B.S. Atal, "Towards Determining Articulator Positions From the Speech
Signal", Speech Comm. Seminar, Stockholm, Sweden, pp. 1-9, Aug. 1974.
52. B.S. Atal, J.J. Chang, M .V. Matthews and J.W. Tukey, "Inversion of
articulatory-to-acoustic transformation in the Vocal Tract by a Computer
Sorting Technique", JA SA , Vol. 63, No. 5, pp. 1535-1550, May 1978.
116
53. M. Sondhi and J.R. Resnick, "The inverse problem of the Vocal Tract:
Numerical Methods, Acoustical Experiments, and Speech synthesis", JA SA ,
Vol. 73, No. 3, pp. 985-1002, March 1983.
54. S.E. Levinson and C.E. Schmidt, "Adaptive Computation of Articulatory
Parameters from the Speech Signal", JA SA , Vol. 74, No. 4, pp. 114 5 -115 4 ,
Oct. 1983.
55. F. Charpentier, "Determination of the Vocal Tract Shape from the formants
by analysis of the Articulatory-to-Acoustic Non-Linearities", Speech Comm.,
Vol. 3, pp. 291-308, 1984.
56. K . Shirai and M. Honda, "Estimation of Articulatory Motion from Speech
Waves, and its application to Automatic Recognition", Spoken Language
Generation and Understanding, pp. 87-99, : D. Reidel Publishing Co., 1980.
57. D.E. Veeneman and S.L. BeMent , "Automatic glottal inverse filtering from
speech and electographic signals", IE E E Trans. ASSP, Vol. 33, No. 2, April
1985.
117