Speech Processing
Production and Classification of Speech Sounds
April 22 2023 Veton Keumlpuska 2
Introduction Simplified view of Speech Production (see
Figure 31 in the next slide) Lungs ndash act as a power supply and provide
airflow to the larynx stage Larynx ndash modulates airflow and provides either
Periodic puff-like airflow or Noisy airflow to vocal tract
Vocal-tract ndash gives the modulated airflow its ldquocolorrdquo (spectrally shaping the source) with Oral Nasal and Pharynx cavities
April 22 2023 Veton Keumlpuska 3
Figure 31
April 22 2023 Veton Keumlpuska 4
Introduction Sound sources can also be generated by constrictions and
boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source
Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source
Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive
Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse
April 22 2023 Veton Keumlpuska 5
Example of ldquoShoprdquo
Noise like signal Periodic Source Impulse Source
April 22 2023 Veton Keumlpuska 2
Introduction Simplified view of Speech Production (see
Figure 31 in the next slide) Lungs ndash act as a power supply and provide
airflow to the larynx stage Larynx ndash modulates airflow and provides either
Periodic puff-like airflow or Noisy airflow to vocal tract
Vocal-tract ndash gives the modulated airflow its ldquocolorrdquo (spectrally shaping the source) with Oral Nasal and Pharynx cavities
April 22 2023 Veton Keumlpuska 3
Figure 31
April 22 2023 Veton Keumlpuska 4
Introduction Sound sources can also be generated by constrictions and
boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source
Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source
Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive
Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse
April 22 2023 Veton Keumlpuska 5
Example of ldquoShoprdquo
Noise like signal Periodic Source Impulse Source
April 22 2023 Veton Keumlpuska 3
Figure 31
April 22 2023 Veton Keumlpuska 4
Introduction Sound sources can also be generated by constrictions and
boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source
Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source
Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive
Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse
April 22 2023 Veton Keumlpuska 5
Example of ldquoShoprdquo
Noise like signal Periodic Source Impulse Source
April 22 2023 Veton Keumlpuska 4
Introduction Sound sources can also be generated by constrictions and
boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source
Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source
Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive
Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse
April 22 2023 Veton Keumlpuska 5
Example of ldquoShoprdquo
Noise like signal Periodic Source Impulse Source
April 22 2023 Veton Keumlpuska 5
Example of ldquoShoprdquo
Noise like signal Periodic Source Impulse Source
April 22 2023 Veton Keumlpuska 6
Introduction Distinguishable speech sounds are determined
not only by source but also by different vocal tract configurations and combination of both
Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations
(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that
provide the same meaning (eg homonyms) The study of these sound variations is called phonetics
Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules
Introduction This chapter will cover
Description of speech production mechanism
Resulting variety of phonetic sound patterns
How these sounds differ among different speakers
April 22 2023 Veton Keumlpuska 7
Anatomy and Physiology of Speech Production
Introduction
April 22 2023 Veton Keumlpuska 8
April 22 2023 Veton Keumlpuska 9
Anatomy and Physiology of Speech Production Anatomy of speech production is shown in
Figure 32
Lungs Lungs
Inhalation and exhalation of air
Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe
During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the
length of sentencephrase Lung air pressure during this time is maintained at a
constant level slightly above the atmospheric pressure
April 22 2023 Veton Keumlpuska 10
April 22 2023 Veton Keumlpuska 11
Anatomy and Physiology of Speech Production Larynx
Complicated system of cartilages flesh muscles and ligaments
Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are
~15 mm in men ~13 mm in women
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Introduction This chapter will cover
Description of speech production mechanism
Resulting variety of phonetic sound patterns
How these sounds differ among different speakers
April 22 2023 Veton Keumlpuska 7
Anatomy and Physiology of Speech Production
Introduction
April 22 2023 Veton Keumlpuska 8
April 22 2023 Veton Keumlpuska 9
Anatomy and Physiology of Speech Production Anatomy of speech production is shown in
Figure 32
Lungs Lungs
Inhalation and exhalation of air
Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe
During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the
length of sentencephrase Lung air pressure during this time is maintained at a
constant level slightly above the atmospheric pressure
April 22 2023 Veton Keumlpuska 10
April 22 2023 Veton Keumlpuska 11
Anatomy and Physiology of Speech Production Larynx
Complicated system of cartilages flesh muscles and ligaments
Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are
~15 mm in men ~13 mm in women
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Anatomy and Physiology of Speech Production
Introduction
April 22 2023 Veton Keumlpuska 8
April 22 2023 Veton Keumlpuska 9
Anatomy and Physiology of Speech Production Anatomy of speech production is shown in
Figure 32
Lungs Lungs
Inhalation and exhalation of air
Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe
During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the
length of sentencephrase Lung air pressure during this time is maintained at a
constant level slightly above the atmospheric pressure
April 22 2023 Veton Keumlpuska 10
April 22 2023 Veton Keumlpuska 11
Anatomy and Physiology of Speech Production Larynx
Complicated system of cartilages flesh muscles and ligaments
Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are
~15 mm in men ~13 mm in women
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 9
Anatomy and Physiology of Speech Production Anatomy of speech production is shown in
Figure 32
Lungs Lungs
Inhalation and exhalation of air
Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe
During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the
length of sentencephrase Lung air pressure during this time is maintained at a
constant level slightly above the atmospheric pressure
April 22 2023 Veton Keumlpuska 10
April 22 2023 Veton Keumlpuska 11
Anatomy and Physiology of Speech Production Larynx
Complicated system of cartilages flesh muscles and ligaments
Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are
~15 mm in men ~13 mm in women
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Lungs Lungs
Inhalation and exhalation of air
Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe
During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the
length of sentencephrase Lung air pressure during this time is maintained at a
constant level slightly above the atmospheric pressure
April 22 2023 Veton Keumlpuska 10
April 22 2023 Veton Keumlpuska 11
Anatomy and Physiology of Speech Production Larynx
Complicated system of cartilages flesh muscles and ligaments
Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are
~15 mm in men ~13 mm in women
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 11
Anatomy and Physiology of Speech Production Larynx
Complicated system of cartilages flesh muscles and ligaments
Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are
~15 mm in men ~13 mm in women
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Larynx
April 22 2023 Veton Keumlpuska 12
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 13
Anatomy and Physiology of Speech Production Three primary states of the vocal folds
Breathing ndash Arytenoid Cartilages are held outward
Voiced - Arytenoid Cartilages are held close together
Unvoiced ndash Arytenoid Cartilages are held outward or partially closed
Complex motion of the vocal folds illustrated in Figure 34
Nonlinear two-mass model of Flanagan et al (Figure 35)
Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun
Dictionary
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Anatomy and Physiology of Speech Production Flanagan et al
model
April 22 2023 Veton Keumlpuska 14
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 15
Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a
function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a
maximum Return phase Time interval from the maximum air flow until the
glottal closure Specific flow shape can change with
Speaker Speaking style And specific speech sound
Glottal air-flow is referred to glottal flow
Time duration of one glottal cycle is referred to as the pitch period
Reciprocal of pitch period is referred to as pitch also as fundamental frequency
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 16
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 17
Example 31 Consider a glottal flow waveform model of the form
u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P
Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as
u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained
k
kPnnp ][][
kkGW
PU ][)()(1][
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 18
Example 31
kkk
kk
WGP
U
GWP
U
)()(1][
)()( )(1][
where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch
As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes
Effect of the harmonics of the glottal waveform on the spectrum
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 19
Figure 37
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 20
Example 31 Degrease in pitch period () causes increase () in the
spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated
window Fourier transform W(-k) weighted by G(k)
Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 21
Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by
harmonics Typically the spectral envelope of the harmonics (governed by the glottal
flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details
The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across
consecutive pitch periods ndash amplitude ldquoshimmerrdquo
Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)
system
Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg
hoarse voice)
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 22
Anatomy and Physiology of Speech Production States of Vocal Folds
Breathing Voicing Unvoicing ndash
Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds
Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 23
Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement
Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch
Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and
overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds
Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 24
Anatomy and Physiology of Speech Production
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 25
Examples of atypical voice types
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 26
Vocal Tract Comprised of the oral cavity
From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the
velum Oral tract takes on many different lengths and cross-
sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw
Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2
Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 27
Spectral Shaping Under a certain conditions the relation
between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances
Resonance frequencies of the vocal tract are called formant frequencies or simply formants
Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 28
Figure 310
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 29
Spectral Shaping The peaks of the spectrum of the vocal tract response
correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract
with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the
unit circle) its transfer function can be expressed either in product or partial fraction expansion form
i
i
N
k kk
k
N
kkk
zczcAzH
zczc
AzH
111
1
11
)1)(1()(
)1)(1()(
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 30
Spectral Shaping Formants of the vocal tract are numbered from the
low to high formants according to their location F1 F2 etc
In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a
female Female speakers have lower formants than children
Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then
The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Vowels
April 22 2023 Veton Keumlpuska 31
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 32
Example 32 Consider a periodic glottal flow source of the form
u[n]=g[n]p[n]
Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by
x[n]=h[n](g[n]p[n])
A window center at time w[n] is applied to the vocal tract output to obtain the speech segment
x[n]=w[n]h[n](g[n]p[n])
Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 33
Example 32
Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the
windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions
(unlike example 31 consisting only of glottal contribution)
kkkk
kk
WGHP
X
GHWP
X
)()()(1)(
)()()()(1)(
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 34
Example 32
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 35
Example 32 The general upward or downward slope of the spectral
envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle
eg a gradual or abrupt closing and by The manner in which formant tails add
Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 36
Spectral Shaping Previous example is important because
It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency
A formant corresponds to the vocal tract pole (resonant frequency)
Harmonics arise due to the periodicity of glottal source (pitch)
In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation
On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 37
Example 33 A soprano singer often signs a tone whose first
harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments
To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 38
Figure 312
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Nasal Sounds
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 40
Spectral Shaping Nasal and oral components of the vocal tract are coupled
by the velum When the vocal tract velum is lowered ndash introducing
an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out
through the nose
The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 41
Spectral Shaping Nose
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 42
Spectral Shaping Mouse
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 43
Spectral Shaping Because the nasal cavity (unlike the oral tract) is
essentially constant characteristics of nasal sounds may be particularly useful in speaker identification
Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be
nasalized (eg nasalized vowel) There are two dominant effects that characterize
nasalization Broadening of the formant bandwidth of oral tract because
of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract
transfer function) due to the absorption of energy at the resonances of the nasal passage
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Plosives
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 45
Source Generation In previous section the effect of vocal tract
shape in the sound production was discussed
In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 46
Source Generation Plosives ldquoDroprdquo
VOT
Aspiration
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Fricatives
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 48
Source Generation Another sound source is created when the tongue is
very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)
As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of
inputs The source spectrum is shaped at all frequencies by |H()|
Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 49
Source Generation Fricatives ldquoNASArdquo
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 50
Source Generation There is another class of the source type that is
generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices
with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract
Vortex can be thought off as a tiny rotational airflow in the oral tract
There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 51
Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal
source Unvoiced Speech sounds not generated with periodic
glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the
moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral
tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing
the vocal folds but without oscillations Example ldquoherdquo
However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example
ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 52
Categorization of Sound By Source
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 54
Spectrographic Analysis of Speech Speech waveform consists of a sequence of
different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic
signal of the word ldquotordquo cannot capture this time-varying frequency content
In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 55
Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding
(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to
avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum
Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1
Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal
wherex[n]= w[n]x[n]
represents the windowed speech segments as function of the window center at time
n
njenxX ][)(
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 56
Spectrographic Analysis of Speech The spectrogram is graphically displayed as
S() = |X()|2
S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal
For each window position one could plot S() A better and more compact representation of time-frequency
display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page
This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms
Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies
Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 57
Spectrographic Analysis of Speech
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 58
Wide-band Spectrogram
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 59
Narrow-band Spectrogram
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 60
Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the
output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]
x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]
Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as
frequency fundametal theis2 and 2 whereand
)()()(~where
)()(~1)(2
2
Pk
P
GHH
WHP
S
k
kk
k
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 61
Spectrographic Analysis of Speech Difference of narrowband and wideband
spectrogram is in the length of the (analysis) window w[n]
Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at
least two pitch periods Under the conditions that
The main lobes of shifted window Fourier transforms are non-overlapping and that
Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)
k
kk WHP
S 22
2 )()(~1)(
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 62
Spectrographic Analysis of Speech Narrowband Spectrogram (cont)
Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram
Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 63
Spectrographic Analysis of SpeechWideband Spectrogram
Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform
(recall the uncertainty principle) Widening of Fourier transform will cause neighboring
harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions
From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 64
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)
Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window
][)(~)(2
EHS k
n
nxE 2][][
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 65
Spectrographic Analysis of SpeechWideband Spectrogram (cont)
Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather
than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding
through fluctuating energy regions of the speech waveform
Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 66
Figure 315
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 67
Figure 316
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
MATLAB
April 22 2023 Veton Keumlpuska 68
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
MATLAB
April 22 2023 Veton Keumlpuska 69
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
MATLAB
April 22 2023 Veton Keumlpuska 70
Freq
uenc
y [H
z]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
1000
2000
3000
4000
5000
6000
7000
8000
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
MATLAB
April 22 2023 Veton Keumlpuska 71
0 05 1 15 2 25 3
-15
-1
-05
0
05
1
15
2
Nor
mal
ized
Mag
nitu
deFr
eque
ncy
[Hz]
Time [s]
SPECTROGRAM
0 05 1 15 2 25 30
2000
4000
6000
8000
She had your dark suit in greasy wash water all year
h
sh
iy
she
hv
ae
dcl
jh
had
ax-h
your
dcl
d
aa
rkc
lk
dark
s
ux
tcl
suit
ax-h
n
in
gcl
g
r
iy
s
ix
greasy
w
aa
sh
wash
w
ao
dx
axr
water
ao
l
all
y
ih
axr
year
h
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 72
Categorization of Speech Sounds Sound source can be created with either the
vocal folds or constriction in the vocal tract
Classification of speech sounds can be also be done from the following perspectives
1 The nature of the source Periodic Noisy Impulsive or Combination of the three
2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal
passage by way of velum3 The time-domain waveform which gives the pressure change with
time at the lips output4 The time-varying spectral characteristics revealed through the
spectrogram
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 73
Elements of a Language Phoneme ndash a fundamental distinctive unit of a
language To emphasize the distinction between the concept of a
phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme
Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words
If first two factors are used to study speech sounds then this is referred to as articulatory phonetics
If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 74
Elements of a Language One broad classification for English language is done in
terms of Vowels Consonants Diphthongs Affricates and Semi-vowels
In the next slide this classification is illustrated in Figure 317
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 75
Figure 317
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 76
Elements of a Language Phonemes arise from a combination of vocal fold and vocal
tract articulatory features Articulatory features (corresponding to the first 2 category
descriptors) include Vocal fold state
Vibrating or Open
Tongue position and height Front Central Back along the palate
Constriction Partial Complete
Velum state Nasal sound Not a nasal sound
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 77
Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number
11 in Polynesian 141 in the ldquoclickrdquo language of Khosian
Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words
A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)
The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators
The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different
Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 78
Elements of a Language Vowels Vowels
Source quasi-periodic Pitch (not important to categorize a sound in English however in
Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)
System Each vowel phoneme corresponds to a different vocal tract
configuration Spectrogram
The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)
Waveform Certain vowels properties are also seen in the speech waveform within a
pitch period (see Figure 319 in the slide after next)
In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic
variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 79
Figure 318
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 80
Figure 319
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 81
Elements of a Language Nasals Nasals
Source Quasi-periodic airflow puffs from the vibrating vocal folds
System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the
nostrils Nasal consonants are distinguished by the place along the oral tract at
which the tongue makes a constriction (Figure 320) Spectrogram
Is dominated by the low resonance of the large volume of the nasal cavity
Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the
vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the
nasal tract Consequently nasals have very low energy in high-frequency range
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 82
Figure 320
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 83
Figure 321
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 84
Elements of a Language Fricatives There are two broad classes of fricatives
Voiced and Unvoiced
Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the
constriction Noise is generated by turbulent airflow at some point of constriction
along the oral tract Constriction is narrower than with vowels
System The location of the constriction by the tongue lips determines which
sound is produced Back Center or Front of the oral tract as well as The teeth or lips
Spectrogram Noise like Energy is concentrated in higher frequencies
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 85
Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal
flow component can be expressed as
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Voiced fricative simplified model of the output at the lips
xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]
Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]
xq[n] = hf[n](q[n]u[n])
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 86
Example 34 We assume in simplified model that the results of the
two airflow sources add x[n] = xg[n] + xq[n]
= h[n]u[n] + hf[n](q[n]u[n])
See Exercise 310 for special characteristics of x[n]
Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due
to traveling vortices)
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 87
Elements of a Language Fricatives Spectrogram
Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while
Voiced fricatives often show both noise and harmonics
Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on
quasi-periodic signal
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
Elements of a Language Whisper Whisper
Forms a class of its own under general category of Consonants
Turbulent flow is produced at the glottis rather than at the vocal tract constriction
April 22 2023 Veton Keumlpuska 88
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 89
Figure 324 - Fricatives
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 90
Figure 323
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 91
Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however
brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced
System Constriction can occur at
Front Center or Back of the oral tract (Figure 324)
Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst
With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no
aspiration There is much shorter delay between the burst and the voicing of the vowel onset
Figure 326 compares voicedunvoiced plosive pair
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 92
Elements of a Language Plosives Waveform
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 93
Elements of a Language Plosives Spectrogram
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 94
Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive
Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by
u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P
Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept
introduced in Chapter 2
In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]
h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m
The output then can be written using generalization of the convolution operator
We have assumed that two outputs can be linearly combined
mnmnhmnumnhnxm
fm
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 95
Elements of a Language Transitional Speech Sounds Diphthongs
Vowel like nature with vibrating vocal folds
Do not have a steady vocal tract configuration They are produced by
varying in time the vocal tract smoothly between two vowel configurations
Characterized by movement from one vowel target to another
hide Y out W boy O new JU
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 96
Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like
sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)
Glides Greater constriction of oral tract during the
transition and Greater speed of the oral tract movement
compared to diphthongs
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 97
Figure 328 ndash Liquids amp Glides
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 98
Elements of a Language Transitional Speech Sounds Affricates are the counterpart of
diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that
the affricates have A fricative portion preceded by a complete
constriction of the oral cavity Formed at the same place as for the plosive
Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 99
Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a
target state or shape often the target is never reached Our speech anatomy cannot move to a desired position
instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and
graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going
Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level
Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo
Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 100
Prosody The Melody of Speech Prosody of a language is defined by the
rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)
These rules are followed to convey different Meaning Stress and Emotion
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 101
Figure 329 - Prosody
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation
April 22 2023 Veton Keumlpuska 102
Figure 330 ndash Global Coarticulation