+ All Categories
Home > Documents > Speech Processing

Speech Processing

Date post: 21-Feb-2016
Category:
Upload: nuru
View: 39 times
Download: 0 times
Share this document with a friend
Description:
Speech Processing. Production and Classification of Speech Sounds. Introduction. Simplified view of Speech Production (see Figure 3.1 in the next slide) Lungs – act as a power supply and provide airflow to the larynx stage. Larynx – modulates airflow and provides either: - PowerPoint PPT Presentation
Popular Tags:
102
Speech Processing Production and Classification of Speech Sounds
Transcript
Page 1: Speech Processing

Speech Processing

Production and Classification of Speech Sounds

April 22 2023 Veton Keumlpuska 2

Introduction Simplified view of Speech Production (see

Figure 31 in the next slide) Lungs ndash act as a power supply and provide

airflow to the larynx stage Larynx ndash modulates airflow and provides either

Periodic puff-like airflow or Noisy airflow to vocal tract

Vocal-tract ndash gives the modulated airflow its ldquocolorrdquo (spectrally shaping the source) with Oral Nasal and Pharynx cavities

April 22 2023 Veton Keumlpuska 3

Figure 31

April 22 2023 Veton Keumlpuska 4

Introduction Sound sources can also be generated by constrictions and

boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source

Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source

Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive

Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse

April 22 2023 Veton Keumlpuska 5

Example of ldquoShoprdquo

Noise like signal Periodic Source Impulse Source

>

April 22 2023 Veton Keumlpuska 6

Introduction Distinguishable speech sounds are determined

not only by source but also by different vocal tract configurations and combination of both

Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations

(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that

provide the same meaning (eg homonyms) The study of these sound variations is called phonetics

Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 2: Speech Processing

April 22 2023 Veton Keumlpuska 2

Introduction Simplified view of Speech Production (see

Figure 31 in the next slide) Lungs ndash act as a power supply and provide

airflow to the larynx stage Larynx ndash modulates airflow and provides either

Periodic puff-like airflow or Noisy airflow to vocal tract

Vocal-tract ndash gives the modulated airflow its ldquocolorrdquo (spectrally shaping the source) with Oral Nasal and Pharynx cavities

April 22 2023 Veton Keumlpuska 3

Figure 31

April 22 2023 Veton Keumlpuska 4

Introduction Sound sources can also be generated by constrictions and

boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source

Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source

Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive

Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse

April 22 2023 Veton Keumlpuska 5

Example of ldquoShoprdquo

Noise like signal Periodic Source Impulse Source

>

April 22 2023 Veton Keumlpuska 6

Introduction Distinguishable speech sounds are determined

not only by source but also by different vocal tract configurations and combination of both

Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations

(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that

provide the same meaning (eg homonyms) The study of these sound variations is called phonetics

Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 3: Speech Processing

April 22 2023 Veton Keumlpuska 3

Figure 31

April 22 2023 Veton Keumlpuska 4

Introduction Sound sources can also be generated by constrictions and

boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source

Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source

Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive

Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse

April 22 2023 Veton Keumlpuska 5

Example of ldquoShoprdquo

Noise like signal Periodic Source Impulse Source

>

April 22 2023 Veton Keumlpuska 6

Introduction Distinguishable speech sounds are determined

not only by source but also by different vocal tract configurations and combination of both

Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations

(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that

provide the same meaning (eg homonyms) The study of these sound variations is called phonetics

Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 4: Speech Processing

April 22 2023 Veton Keumlpuska 4

Introduction Sound sources can also be generated by constrictions and

boundaries that are made within the vocal tract itself Periodic source Noisy source or Impulsive airflow source

Note that speech production mechanism does not generate a perfect periodic impulsive or noisy source

Three general categories of the source for speech sounds1 Periodic2 Noisy3 Impulsive

Illustration of each in the word ldquoshoprdquo ldquoshrdquo ndash noisy ldquoordquo ndash periodic ldquoprdquo - impulse

April 22 2023 Veton Keumlpuska 5

Example of ldquoShoprdquo

Noise like signal Periodic Source Impulse Source

>

April 22 2023 Veton Keumlpuska 6

Introduction Distinguishable speech sounds are determined

not only by source but also by different vocal tract configurations and combination of both

Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations

(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that

provide the same meaning (eg homonyms) The study of these sound variations is called phonetics

Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 5: Speech Processing

April 22 2023 Veton Keumlpuska 5

Example of ldquoShoprdquo

Noise like signal Periodic Source Impulse Source

>

April 22 2023 Veton Keumlpuska 6

Introduction Distinguishable speech sounds are determined

not only by source but also by different vocal tract configurations and combination of both

Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations

(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that

provide the same meaning (eg homonyms) The study of these sound variations is called phonetics

Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 6: Speech Processing

April 22 2023 Veton Keumlpuska 6

Introduction Distinguishable speech sounds are determined

not only by source but also by different vocal tract configurations and combination of both

Speech sound classes are referred to as phonemes Phonemics is the discipline that studies phoneme realizations

(eg in a language) Each phoneme class provides a certain meaning in a word Within a phoneme class there exist many sound variations that

provide the same meaning (eg homonyms) The study of these sound variations is called phonetics

Phonemes are the basic building blocks of a language They are concatenated (more or less) as discrete elements into words According to a certain phonemic and grammatical rules

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 7: Speech Processing

Introduction This chapter will cover

Description of speech production mechanism

Resulting variety of phonetic sound patterns

How these sounds differ among different speakers

April 22 2023 Veton Keumlpuska 7

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 8: Speech Processing

Anatomy and Physiology of Speech Production

Introduction

April 22 2023 Veton Keumlpuska 8

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 9: Speech Processing

April 22 2023 Veton Keumlpuska 9

Anatomy and Physiology of Speech Production Anatomy of speech production is shown in

Figure 32

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 10: Speech Processing

Lungs Lungs

Inhalation and exhalation of air

Connected through trachea (ldquowindpiperdquo) and epiglottis to Vocal Tract ~12-cm-long and ~15-2-cm-diameter pipe

During the speaking rhythmical amp synchronized cycle of inhalation and exhalation changes to accommodate speech production Duration of exhalation becomes roughly equal to the

length of sentencephrase Lung air pressure during this time is maintained at a

constant level slightly above the atmospheric pressure

April 22 2023 Veton Keumlpuska 10

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 11: Speech Processing

April 22 2023 Veton Keumlpuska 11

Anatomy and Physiology of Speech Production Larynx

Complicated system of cartilages flesh muscles and ligaments

Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 33 Vocal folds are

~15 mm in men ~13 mm in women

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 12: Speech Processing

Larynx

April 22 2023 Veton Keumlpuska 12

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 13: Speech Processing

April 22 2023 Veton Keumlpuska 13

Anatomy and Physiology of Speech Production Three primary states of the vocal folds

Breathing ndash Arytenoid Cartilages are held outward

Voiced - Arytenoid Cartilages are held close together

Unvoiced ndash Arytenoid Cartilages are held outward or partially closed

Complex motion of the vocal folds illustrated in Figure 34

Nonlinear two-mass model of Flanagan et al (Figure 35)

Arytenoid armiddotymiddottemiddotnoid Pronunciation ˌa-rə-ˈtē-ˌnoid ə-ˈri-tən-ˌoid Function adjective Etymology New Latin arytaenoides from Greek arytainoeidēs literally ladle-shaped from arytaina ladle Date circa 1751 1 relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 relating to or being either of a pair of small muscles or an unpaired muscle of the larynx mdash arytenoid noun

Dictionary

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 14: Speech Processing

Anatomy and Physiology of Speech Production Flanagan et al

model

April 22 2023 Veton Keumlpuska 14

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 15: Speech Processing

April 22 2023 Veton Keumlpuska 15

Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a

function of time obtained waveform will be approximately similar to that of Figure 36 Closed phase folds are closed and no flow occurs Open phase folds are open and the flow increases up to a

maximum Return phase Time interval from the maximum air flow until the

glottal closure Specific flow shape can change with

Speaker Speaking style And specific speech sound

Glottal air-flow is referred to glottal flow

Time duration of one glottal cycle is referred to as the pitch period

Reciprocal of pitch period is referred to as pitch also as fundamental frequency

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 16: Speech Processing

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 16

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 17: Speech Processing

April 22 2023 Veton Keumlpuska 17

Example 31 Consider a glottal flow waveform model of the form

u[n] = g[n]p[n]Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P

Because the waveform is infinitely long a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window The window denoted by w[n] is centered at time as illustrated in Figure 37 ndash next slide and the resulting waveform segment is written as

u[n ] = w[n](g[n]p[n])Using Multiplication and Convolution Theorem of Chapter 2 the following expression in frequency domain is obtained

k

kPnnp ][][

kkGW

PU ][)()(1][

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 18: Speech Processing

April 22 2023 Veton Keumlpuska 18

Example 31

kkk

kk

WGP

U

GWP

U

)()(1][

)()( )(1][

where W() is the Fourier transform of w[n] G() is the Fourier transform of g[n] k=(2P)k where 2P is the fundamental frequency or pitch

As illustrated in Figure 37 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes

Effect of the harmonics of the glottal waveform on the spectrum

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 19: Speech Processing

April 22 2023 Veton Keumlpuska 19

Figure 37

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 20: Speech Processing

April 22 2023 Veton Keumlpuska 20

Example 31 Degrease in pitch period () causes increase () in the

spacing of harmonics of glottal waveform k=(2P)k First harmonic is also the fundamental frequency At each harmonic frequency there is a translated

window Fourier transform W(-k) weighted by G(k)

Magnitude of the spectral shaping function ie glottal flow |G(k)| is referred to as spectral envelope of the harmonics

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 21: Speech Processing

April 22 2023 Veton Keumlpuska 21

Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by

harmonics Typically the spectral envelope of the harmonics (governed by the glottal

flow over tone cycle has on average a -12 dBoctave rolloff Rolloff is dependent on the nature of airflow and speaker characteristics See Exercise 318 for further details

The model in Example 31 is ideal in the sense that even for sustained voicing ndash a fixed pitch period is almost never maintained in time It can ldquorandomlyrdquo vary over successive periods ndash pitch ldquojitterrdquo Amplitude of the airflow velocity within a glottal cycle may differ across

consecutive pitch periods ndash amplitude ldquoshimmerrdquo

Those variations are due to (perhaps) Time-varying characteristics of the vocal tract and vocal folds Nonlinear behavior in the speech anatomy or Appear random while being the result of an underlying deterministic (chaotic)

system

Jitter and shimmer are one component that give the vowels its naturalness In contrast a monotone pitch and fixed amplitude results in a machine-like sound Voice character is determined by the extend of jitter and shimmer in voice (eg

hoarse voice)

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 22: Speech Processing

April 22 2023 Veton Keumlpuska 22

Anatomy and Physiology of Speech Production States of Vocal Folds

Breathing Voicing Unvoicing ndash

Turbulence at the vocal folds ndash aspiration Example ldquoherdquo ndash whispered sounds

Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 23: Speech Processing

April 22 2023 Veton Keumlpuska 23

Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement

Creaky voice ndash very tense vocal folds with only a short portion of the folds oscillating Resulting in a voice that has High pitch and Irregular pitch

Vocal fry ndash focal folds are massy and relaxed resulting in a voice with an abnormally Low pitch Irregular pitch Characterized by secondary glottal pulses close to and

overlapping the primary glottal pulse Result of coupling of false vocal folds with true vocal folds

Diplophonic voice ndash secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 39b and Figure 316)

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 24: Speech Processing

April 22 2023 Veton Keumlpuska 24

Anatomy and Physiology of Speech Production

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 25: Speech Processing

April 22 2023 Veton Keumlpuska 25

Examples of atypical voice types

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 26: Speech Processing

April 22 2023 Veton Keumlpuska 26

Vocal Tract Comprised of the oral cavity

From larynx To the lips including the nasal passage ndash coupled to the oral tract by way of the

velum Oral tract takes on many different lengths and cross-

sections This is accomplished by moving the articulators Tongue Teeth Lips Jaw

Average length for a adult male is 17 cm and cross sectional area of up to 20 cm2

Purpose of vocal tract is to Spectrally ldquocolorrdquo the source and Generate new sources for sound production

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 27: Speech Processing

April 22 2023 Veton Keumlpuska 27

Spectral Shaping Under a certain conditions the relation

between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances

Resonance frequencies of the vocal tract are called formant frequencies or simply formants

Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 310

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 28: Speech Processing

April 22 2023 Veton Keumlpuska 28

Figure 310

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 29: Speech Processing

April 22 2023 Veton Keumlpuska 29

Spectral Shaping The peaks of the spectrum of the vocal tract response

correspond approximately to its formants For a time-invariant all-pole linear system model of vocal tract

with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r0) Because the vocal tract is assumed stable (with poles inside the

unit circle) its transfer function can be expressed either in product or partial fraction expansion form

i

i

N

k kk

k

N

kkk

zczcAzH

zczc

AzH

111

1

11

)1)(1()(

)1)(1()(

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 30: Speech Processing

April 22 2023 Veton Keumlpuska 30

Spectral Shaping Formants of the vocal tract are numbered from the

low to high formants according to their location F1 F2 etc

In general the formant frequencies degrease as the vocal tract length increases Male speakers tend to have lower formants than a

female Female speakers have lower formants than children

Under a vocal-tractrsquos Linearity and time-invariance assumption and When the sound source occurs at the glottis Then

The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 31: Speech Processing

Vowels

April 22 2023 Veton Keumlpuska 31

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 32: Speech Processing

April 22 2023 Veton Keumlpuska 32

Example 32 Consider a periodic glottal flow source of the form

u[n]=g[n]p[n]

Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n] the vocal tract output is given by

x[n]=h[n](g[n]p[n])

A window center at time w[n] is applied to the vocal tract output to obtain the speech segment

x[n]=w[n]h[n](g[n]p[n])

Using Multiplication and Convolution Theorems Fourier transform of the speech segment representing frequency domain representation is obtained

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 33: Speech Processing

April 22 2023 Veton Keumlpuska 33

Example 32

Where W() is the Fourier transform of w[n] and k=(2P)k and (2P) is fundamental frequency or pitch Figure 311 (next slide) illustrates that the spectral shaping of the

windowed transform at the harmonics 1 2 hellip N is determined by the spectral envelope |H()G()| - consisting of Glottal and Vocal tract contributions

(unlike example 31 consisting only of glottal contribution)

kkkk

kk

WGHP

X

GHWP

X

)()()(1)(

)()()()(1)(

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 34: Speech Processing

April 22 2023 Veton Keumlpuska 34

Example 32

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 35: Speech Processing

April 22 2023 Veton Keumlpuska 35

Example 32 The general upward or downward slope of the spectral

envelope also called spectral tilt is influenced by The nature of the glottal flow waveform over a cycle

eg a gradual or abrupt closing and by The manner in which formant tails add

Note also from the figure 311 that the formant locations are not always clear from the short-time Fourier transform magnitude |X()| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics This is especially the case for high pitched speech

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 36: Speech Processing

April 22 2023 Veton Keumlpuska 36

Spectral Shaping Previous example is important because

It illustrates the difference between Formant (resonance frequency of vocal tract) and Harmonic frequency

A formant corresponds to the vocal tract pole (resonant frequency)

Harmonics arise due to the periodicity of glottal source (pitch)

In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation

On the other hand the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice)

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 37: Speech Processing

April 22 2023 Veton Keumlpuska 37

Example 33 A soprano singer often signs a tone whose first

harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung As shown in the next figure (Figure 312) when the nulls of the vocal tract spectrum are sampled at the harmonics the resulting sound is weak especially in the face of competing instruments

To enhance the sound the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 34) and can match the frequency of the first harmonic thus generating a louder sound

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 38: Speech Processing

April 22 2023 Veton Keumlpuska 38

Figure 312

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 39: Speech Processing

Nasal Sounds

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 40: Speech Processing

April 22 2023 Veton Keumlpuska 40

Spectral Shaping Nasal and oral components of the vocal tract are coupled

by the velum When the vocal tract velum is lowered ndash introducing

an opening into the nasal passage and Oral tract is shut off by the tongue or lipsSound propagates through the nasal passage and out

through the nose

The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds Examples ldquonoserdquo and ldquomouserdquo

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 41: Speech Processing

April 22 2023 Veton Keumlpuska 41

Spectral Shaping Nose

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 42: Speech Processing

April 22 2023 Veton Keumlpuska 42

Spectral Shaping Mouse

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 43: Speech Processing

April 22 2023 Veton Keumlpuska 43

Spectral Shaping Because the nasal cavity (unlike the oral tract) is

essentially constant characteristics of nasal sounds may be particularly useful in speaker identification

Velum can be lowered even when the vocal tract is open When this coupling occurs the resulting sound is said to be

nasalized (eg nasalized vowel) There are two dominant effects that characterize

nasalization Broadening of the formant bandwidth of oral tract because

of loss of energy through nasal passage Introduction of anti-resonances (ie zeros in the vocal tract

transfer function) due to the absorption of energy at the resonances of the nasal passage

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 44: Speech Processing

Plosives

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 45: Speech Processing

April 22 2023 Veton Keumlpuska 45

Source Generation In previous section the effect of vocal tract

shape in the sound production was discussed

In the Figure 310 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted This closure is required when making an impulsive sound (plosives) Build-up of pressure behind the palate and Abrupt release of pressure

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 46: Speech Processing

April 22 2023 Veton Keumlpuska 46

Source Generation Plosives ldquoDroprdquo

VOT

Aspiration

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 47: Speech Processing

Fricatives

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 48: Speech Processing

April 22 2023 Veton Keumlpuska 48

Source Generation Another sound source is created when the tongue is

very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (eg fricatives)

As with periodic glottal sound source a spectral shaping can also occur for either type of input (ie impulse or noise source) There is no harmonic structure with these types of

inputs The source spectrum is shaped at all frequencies by |H()|

Note that the spectrum of noise was idealized assuming a flat spectrum In reality these sources will themselves have a non-flat spectral shape

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 49: Speech Processing

April 22 2023 Veton Keumlpuska 49

Source Generation Fricatives ldquoNASArdquo

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 50: Speech Processing

April 22 2023 Veton Keumlpuska 50

Source Generation There is another class of the source type that is

generated within the vocal tract however it is less understood than noisy and impulsive sources at oral tract constrictions This source arises from the interaction of vortices

with vocal tract boundaries such as the false vocal folds teeth or occlusions in the oral tract

Vortex can be thought off as a tiny rotational airflow in the oral tract

There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 51: Speech Processing

April 22 2023 Veton Keumlpuska 51

Categorization of Sound By Source Voiced Speech sounds generated with a periodic glottal

source Unvoiced Speech sounds not generated with periodic

glottal source There are variety of unvoiced sounds Fricatives - Sounds that are generated from the friction of the

moving air against an oral tract constriction Example ldquothinrdquo Plosives ndash Created with an impulsive source within the oral

tract Example ldquotoprdquo Whispers ndash Barrier made at the vocal folds by partially closing

the vocal folds but without oscillations Example ldquoherdquo

However the unvoiced sounds do not exclusively relate to the sound source That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources Thus above subcategories may exists for voiced sounds Example

ldquozebrardquo vs ldquoshebardquo -- Fricatives ldquobinrdquo vs ldquopinrdquo -- Plosives

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 52: Speech Processing

April 22 2023 Veton Keumlpuska 52

Categorization of Sound By Source

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 53: Speech Processing

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 54: Speech Processing

April 22 2023 Veton Keumlpuska 54

Spectrographic Analysis of Speech Speech waveform consists of a sequence of

different events This time-variation corresponds to highly fluctuating spectral characteristics over time Example of a word ldquotordquo A single Fourier transform of the entire acoustic

signal of the word ldquotordquo cannot capture this time-varying frequency content

In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 55: Speech Processing

April 22 2023 Veton Keumlpuska 55

Spectrographic Analysis of Speech In examples 31 and 32 presented earlier a sliding

(analysis) window concept was introduced This window w[n] is typically tapered at its end (Figure 314) to

avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum

Example - Hamming windoww[n]=054-046cos[2(n-)(Nw-1)] for 0lenleNw-1

Window typically does not necessarily move one sample at a time but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal

wherex[n]= w[n]x[n]

represents the windowed speech segments as function of the window center at time

n

njenxX ][)(

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 56: Speech Processing

April 22 2023 Veton Keumlpuska 56

Spectrographic Analysis of Speech The spectrogram is graphically displayed as

S() = |X()|2

S() ndash is a 2-D (two dimensional) representation of ldquoenergy densityrdquo of the signal

For each window position one could plot S() A better and more compact representation of time-frequency

display of the spectrogram places spectral magnitude measurements vertically in three-dimensional mesh or two-dimensionally with intensity coming out of the page

This display is illustrated (caricature) in Figure 314 This figure also illustrates two kinds of spectrograms

Narrowband ndash it gives good spectral resolution a good view of the frequency content of sine-waves with closely spaced frequencies

Wideband - which gives a good temporal resolution a good view of the temporal context of impulses closely spaced in time

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 57: Speech Processing

April 22 2023 Veton Keumlpuska 57

Spectrographic Analysis of Speech

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 58: Speech Processing

April 22 2023 Veton Keumlpuska 58

Wide-band Spectrogram

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 59: Speech Processing

April 22 2023 Veton Keumlpuska 59

Narrow-band Spectrogram

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 60: Speech Processing

April 22 2023 Veton Keumlpuska 60

Spectrographic Analysis of Speech Note that for voiced speech the speech waveform was approximated as the

output of a linear time-invariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle g[n] with the impulse train p[n] = [n-kP]

x[n]= w[n](p[n]g[n])h[n]x[n]= w[n]p[n]ĥ[n]

Where glottal waveform over a cycle and vocal tract impulse response was combined as ĥ[n] = g[n]h[n] From the result of example 32 the spectrogram of x[n] can be therefore expressed as

frequency fundametal theis2 and 2 whereand

)()()(~where

)()(~1)(2

2

Pk

P

GHH

WHP

S

k

kk

k

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 61: Speech Processing

April 22 2023 Veton Keumlpuska 61

Spectrographic Analysis of Speech Difference of narrowband and wideband

spectrogram is in the length of the (analysis) window w[n]

Narrowband Spectrogram Uses ldquolongrdquo window with a duration of typically at

least two pitch periods Under the conditions that

The main lobes of shifted window Fourier transforms are non-overlapping and that

Corresponding transform side-lobes are negligible from the equation in pervious slide the following approximation holds (exercise 38)

k

kk WHP

S 22

2 )()(~1)(

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 62: Speech Processing

April 22 2023 Veton Keumlpuska 62

Spectrographic Analysis of Speech Narrowband Spectrogram (cont)

Harmonic lines are ldquoresolvedrdquo ndash horizontal striations in the time-frequency plane of the spectrogram

Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (eg plosives that are closely spaced to a succeeding voiced sound are poorly represented)

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 63: Speech Processing

April 22 2023 Veton Keumlpuska 63

Spectrographic Analysis of SpeechWideband Spectrogram

Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 314) Shortening the window widens the Fourier transform

(recall the uncertainty principle) Widening of Fourier transform will cause neighboring

harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure roughly tracing out the spectral envelope |Ĥ()| due to vocal tract and glottal flow contributions

From temporal perspective since the window length is less than a pitch period the window ldquoseesrdquo essentially pieces of the periodically occurring sequence ĥ[n]

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 64: Speech Processing

April 22 2023 Veton Keumlpuska 64

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

For the steady-state voiced sound we can therefore express the wideband spectrogram roughly as (see Exercise 39)

Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window

][)(~)(2

EHS k

n

nxE 2][][

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 65: Speech Processing

April 22 2023 Veton Keumlpuska 65

Spectrographic Analysis of SpeechWideband Spectrogram (cont)

Shows the formants of the vocal tract in frequency also Gives vertical striations in time every pitch period rather

than the harmonic horizontal striations as in narrowband spectrogram Vertical striations arise because the short window is sliding

through fluctuating energy regions of the speech waveform

Figure 315 in the next slide compares the narrowband (20-ms Hamming window) and wideband (4-ms Hamming window) spectrograms

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 66: Speech Processing

April 22 2023 Veton Keumlpuska 66

Figure 315

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 67: Speech Processing

April 22 2023 Veton Keumlpuska 67

Figure 316

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 68: Speech Processing

MATLAB

April 22 2023 Veton Keumlpuska 68

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 69: Speech Processing

MATLAB

April 22 2023 Veton Keumlpuska 69

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 70: Speech Processing

MATLAB

April 22 2023 Veton Keumlpuska 70

Freq

uenc

y [H

z]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

1000

2000

3000

4000

5000

6000

7000

8000

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 71: Speech Processing

MATLAB

April 22 2023 Veton Keumlpuska 71

0 05 1 15 2 25 3

-15

-1

-05

0

05

1

15

2

Nor

mal

ized

Mag

nitu

deFr

eque

ncy

[Hz]

Time [s]

SPECTROGRAM

0 05 1 15 2 25 30

2000

4000

6000

8000

She had your dark suit in greasy wash water all year

h

sh

iy

she

hv

ae

dcl

jh

had

ax-h

your

dcl

d

aa

rkc

lk

dark

s

ux

tcl

suit

ax-h

n

in

gcl

g

r

iy

s

ix

greasy

w

aa

sh

wash

w

ao

dx

axr

water

ao

l

all

y

ih

axr

year

h

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 72: Speech Processing

April 22 2023 Veton Keumlpuska 72

Categorization of Speech Sounds Sound source can be created with either the

vocal folds or constriction in the vocal tract

Classification of speech sounds can be also be done from the following perspectives

1 The nature of the source Periodic Noisy Impulsive or Combination of the three

2 The shape of vocal tract - place and manner of articulation Place of the tongue hump along the oral tact and The degree of the constriction of the hump The shape is also determined by possible connection to the nasal

passage by way of velum3 The time-domain waveform which gives the pressure change with

time at the lips output4 The time-varying spectral characteristics revealed through the

spectrogram

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 73: Speech Processing

April 22 2023 Veton Keumlpuska 73

Elements of a Language Phoneme ndash a fundamental distinctive unit of a

language To emphasize the distinction between the concept of a

phoneme and sounds that convey a phoneme speech scientist use the term phone to mean a particular instantiation of a phoneme

Different languages contain different phoneme sets Syllables contain one or more phonemes Words are formed from one or more syllables Phrases are concatenation of words

If first two factors are used to study speech sounds then this is referred to as articulatory phonetics

If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 74: Speech Processing

April 22 2023 Veton Keumlpuska 74

Elements of a Language One broad classification for English language is done in

terms of Vowels Consonants Diphthongs Affricates and Semi-vowels

In the next slide this classification is illustrated in Figure 317

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 75: Speech Processing

April 22 2023 Veton Keumlpuska 75

Figure 317

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 76: Speech Processing

April 22 2023 Veton Keumlpuska 76

Elements of a Language Phonemes arise from a combination of vocal fold and vocal

tract articulatory features Articulatory features (corresponding to the first 2 category

descriptors) include Vocal fold state

Vibrating or Open

Tongue position and height Front Central Back along the palate

Constriction Partial Complete

Velum state Nasal sound Not a nasal sound

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 77: Speech Processing

April 22 2023 Veton Keumlpuska 77

Elements of a Language In English the combinations of features are such to give 40 phonemes Other languages can yield a smallerlarger number

11 in Polynesian 141 in the ldquoclickrdquo language of Khosian

Rules of a language define which phones can be stringed together and how to form words In Italian consonants are not allowed at the end of words

A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents)

The articulatory properties are influenced by Adjacent phonemes Speaking rate Emphasis in speaking and Time-varying nature of the articulators

The variants of sounds or phones that convey the same phoneme are called the allophones of the phoneme Example ldquobutterrdquo ldquobutrdquo and ldquotordquo were t in each word is somewhat different

Motor theory of perception ndash uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 78: Speech Processing

April 22 2023 Veton Keumlpuska 78

Elements of a Language Vowels Vowels

Source quasi-periodic Pitch (not important to categorize a sound in English however in

Mandarin Chinese language some sounds are interpreted based on the pitch ndash tonal languages)

System Each vowel phoneme corresponds to a different vocal tract

configuration Spectrogram

The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram)

Waveform Certain vowels properties are also seen in the speech waveform within a

pitch period (see Figure 319 in the slide after next)

In spite of the specific properties of different vowels there is much variability of vowel characteristics among speakers Articulatory differences in speakers is one cause of allophonic

variations The place and degree of constriction of the tongue hump and Cross-section and length of vocal tract =gt And therefore the vocal tract formants will vary with speaker

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 79: Speech Processing

April 22 2023 Veton Keumlpuska 79

Figure 318

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 80: Speech Processing

April 22 2023 Veton Keumlpuska 80

Figure 319

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 81: Speech Processing

April 22 2023 Veton Keumlpuska 81

Elements of a Language Nasals Nasals

Source Quasi-periodic airflow puffs from the vibrating vocal folds

System The velum is lowered and the air flows mainly through the nasal cavity Because oral tract is being constricted the sound is radiated at the

nostrils Nasal consonants are distinguished by the place along the oral tract at

which the tongue makes a constriction (Figure 320) Spectrogram

Is dominated by the low resonance of the large volume of the nasal cavity

Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue These resonances absorb acoustic energy and thus are anti-resonances of the

vocal tract Anti-resonances of the oral tract tend to lie beyond the low-resonances of the

nasal tract Consequently nasals have very low energy in high-frequency range

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 82: Speech Processing

April 22 2023 Veton Keumlpuska 82

Figure 320

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 83: Speech Processing

April 22 2023 Veton Keumlpuska 83

Figure 321

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 84: Speech Processing

April 22 2023 Veton Keumlpuska 84

Elements of a Language Fricatives There are two broad classes of fricatives

Voiced and Unvoiced

Source Vocal folds are relaxed and not vibrating for unvoiced fricatives Vocal folds are vibrating simultaneously with noise generation at the

constriction Noise is generated by turbulent airflow at some point of constriction

along the oral tract Constriction is narrower than with vowels

System The location of the constriction by the tongue lips determines which

sound is produced Back Center or Front of the oral tract as well as The teeth or lips

Spectrogram Noise like Energy is concentrated in higher frequencies

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 85: Speech Processing

April 22 2023 Veton Keumlpuska 85

Example 34 A voiced fricative is generated with both a periodic and noise source The periodic glottal

flow component can be expressed as

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Voiced fricative simplified model of the output at the lips

xg[n] = h[n](g[n]p[n]) h[n] a linear time-invariant vocal tract with impulse response under periodic signal u[n]

Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise) The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]

xq[n] = hf[n](q[n]u[n])

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 86: Speech Processing

April 22 2023 Veton Keumlpuska 86

Example 34 We assume in simplified model that the results of the

two airflow sources add x[n] = xg[n] + xq[n]

= h[n]u[n] + hf[n](q[n]u[n])

See Exercise 310 for special characteristics of x[n]

Issues that have been ignored u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity Sources of non-linear effects (distributed sources due

to traveling vortices)

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 87: Speech Processing

April 22 2023 Veton Keumlpuska 87

Elements of a Language Fricatives Spectrogram

Unvoiced fricatives are characterized by a ldquonoisyrdquo spectrum while

Voiced fricatives often show both noise and harmonics

Waveform Unvoiced fricative contains only noise Voiced fricative contains noise superimposed on

quasi-periodic signal

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 88: Speech Processing

Elements of a Language Whisper Whisper

Forms a class of its own under general category of Consonants

Turbulent flow is produced at the glottis rather than at the vocal tract constriction

April 22 2023 Veton Keumlpuska 88

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 89: Speech Processing

April 22 2023 Veton Keumlpuska 89

Figure 324 - Fricatives

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 90: Speech Processing

April 22 2023 Veton Keumlpuska 90

Figure 323

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 91: Speech Processing

April 22 2023 Veton Keumlpuska 91

Elements of a Language Plosives Plosives form a class of sounds where the constriction is complete however

brief followed by the burst of flow As with fricatives plosives can be Voiced and Unvoiced

System Constriction can occur at

Front Center or Back of the oral tract (Figure 324)

Sequence of events1 Complete closure of the oral tract and buildup of air pressure2 Release of air pressure and generation of turbulence over a very short-time duration3 Generation of aspiration due to turbulence at the open vocal folds4 Onset of the following vowel about 40-50 ms after the burst

With voiced plosives vocal folds vibrate for duration of all 4 steps During the period when oral tract is closed we hear a low-frequency vibration due to propagation of vocal folds vibrations through the walls of the throat This activity is referred to as a ldquovoice barrdquo After the release of the burst unlike the unvoiced plosive there is little or no

aspiration There is much shorter delay between the burst and the voicing of the vowel onset

Figure 326 compares voicedunvoiced plosive pair

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 92: Speech Processing

April 22 2023 Veton Keumlpuska 92

Elements of a Language Plosives Waveform

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 93: Speech Processing

April 22 2023 Veton Keumlpuska 93

Elements of a Language Plosives Spectrogram

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 94: Speech Processing

April 22 2023 Veton Keumlpuska 94

Elements of a Language Plosives Example 35 A time ndashvarying system model for the voiced plosive

Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel Assuming that the burst occurs at time n=0 we idealize the burst source as an impulse [n] The glottal flow velocity model for the periodic source component is given by

u[n] = g[n]p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P

Assume that the vocal tract is linear but time-varying due to changing vocal tract shape during its transition from the burst to a following steady vowel This implies that vocal tract output cannot be obtained by the convolution operator Vocal tract output thus must be computed using the time-varying impulse response concept

introduced in Chapter 2

In this simple model the periodic glottal flow excites a time-varying vocal tract with impulse response denoted by h[nm] while the burst excites a time-varying front cavity beyond a constriction denoted by hf[nm]

h[nm] and hf[nm] represent time-varying impulse responses at time n due to a unit sample applied m samples earlier at time n-m

The output then can be written using generalization of the convolution operator

We have assumed that two outputs can be linearly combined

mnmnhmnumnhnxm

fm

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 95: Speech Processing

April 22 2023 Veton Keumlpuska 95

Elements of a Language Transitional Speech Sounds Diphthongs

Vowel like nature with vibrating vocal folds

Do not have a steady vocal tract configuration They are produced by

varying in time the vocal tract smoothly between two vowel configurations

Characterized by movement from one vowel target to another

hide Y out W boy O new JU

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 96: Speech Processing

April 22 2023 Veton Keumlpuska 96

Elements of a Language Transitional Speech Sounds Semi-Vowels Two categories of vowel like

sounds Glides (w as in ldquowerdquo and y as in ldquoyourdquo) and Liquids (r as in ldquoreadrdquo and l as in ldquoletrdquo)

Glides Greater constriction of oral tract during the

transition and Greater speed of the oral tract movement

compared to diphthongs

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 97: Speech Processing

April 22 2023 Veton Keumlpuska 97

Figure 328 ndash Liquids amp Glides

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 98: Speech Processing

April 22 2023 Veton Keumlpuska 98

Elements of a Language Transitional Speech Sounds Affricates are the counterpart of

diphthongs consisting of consonant plosive-fricative combinations The difference as compared to fricatives is that

the affricates have A fricative portion preceded by a complete

constriction of the oral cavity Formed at the same place as for the plosive

Examples tS as in ldquochewrdquo - unvoiced J as in ldquojustrdquo - voiced

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 99: Speech Processing

April 22 2023 Veton Keumlpuska 99

Coarticulation Vocal foldvocal tract muscles are ldquoprogrammedrdquo to seek a

target state or shape often the target is never reached Our speech anatomy cannot move to a desired position

instantaneously and thus past positions influence the present Furthermore to make anatomical movement easy and

graceful the brain anticipates the future and so the articulators at any time instant are influenced by where they have been and where they are going

Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance Coarticulation can occur on different temporal level

Local ndash articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time ldquohorserdquo vs ldquohorseshoerdquo ldquosweeprdquo vs ldquoseeprdquo

Global ndash articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 100: Speech Processing

April 22 2023 Veton Keumlpuska 100

Prosody The Melody of Speech Prosody of a language is defined by the

rules that define changes in speech extending over more than one phoneme Intonation (change in pitch) AmplitudeEnergy (loudness) Timing (articulation rate or rhythm)

These rules are followed to convey different Meaning Stress and Emotion

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 101: Speech Processing

April 22 2023 Veton Keumlpuska 101

Figure 329 - Prosody

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation
Page 102: Speech Processing

April 22 2023 Veton Keumlpuska 102

Figure 330 ndash Global Coarticulation

  • Speech Processing
  • Introduction
  • Figure 31
  • Introduction (2)
  • Example of ldquoShoprdquo
  • Introduction (3)
  • Introduction (4)
  • Anatomy and Physiology of Speech Production
  • Anatomy and Physiology of Speech Production (2)
  • Lungs
  • Anatomy and Physiology of Speech Production (3)
  • Larynx
  • Anatomy and Physiology of Speech Production (4)
  • Anatomy and Physiology of Speech Production (5)
  • Anatomy and Physiology of Speech Production (6)
  • Anatomy and Physiology of Speech Production (7)
  • Example 31
  • Example 31 (2)
  • Figure 37
  • Example 31 (3)
  • Anatomy and Physiology of Speech Production (8)
  • Anatomy and Physiology of Speech Production (9)
  • Anatomy and Physiology of Speech Production (10)
  • Anatomy and Physiology of Speech Production (11)
  • Examples of atypical voice types
  • Vocal Tract
  • Spectral Shaping
  • Figure 310
  • Spectral Shaping (2)
  • Spectral Shaping (3)
  • Vowels
  • Example 32
  • Example 32 (2)
  • Example 32 (3)
  • Example 32 (4)
  • Spectral Shaping (4)
  • Example 33
  • Figure 312
  • Nasal Sounds
  • Spectral Shaping (5)
  • Spectral Shaping Nose
  • Spectral Shaping Mouse
  • Spectral Shaping (6)
  • Plosives
  • Source Generation
  • Source Generation Plosives ldquoDroprdquo
  • Fricatives
  • Source Generation (2)
  • Source Generation Fricatives ldquoNASArdquo
  • Source Generation (3)
  • Categorization of Sound By Source
  • Categorization of Sound By Source (2)
  • Spectrographic Analysis of Speech
  • Spectrographic Analysis of Speech (2)
  • Spectrographic Analysis of Speech (3)
  • Spectrographic Analysis of Speech (4)
  • Spectrographic Analysis of Speech (5)
  • Wide-band Spectrogram
  • Narrow-band Spectrogram
  • Spectrographic Analysis of Speech (6)
  • Spectrographic Analysis of Speech (7)
  • Spectrographic Analysis of Speech (8)
  • Spectrographic Analysis of Speech (9)
  • Spectrographic Analysis of Speech (10)
  • Spectrographic Analysis of Speech (11)
  • Figure 315
  • Figure 316
  • MATLAB
  • MATLAB (2)
  • MATLAB (3)
  • MATLAB (4)
  • Categorization of Speech Sounds
  • Elements of a Language
  • Elements of a Language (2)
  • Figure 317
  • Elements of a Language (3)
  • Elements of a Language (4)
  • Elements of a Language Vowels
  • Figure 318
  • Figure 319
  • Elements of a Language Nasals
  • Figure 320
  • Figure 321
  • Elements of a Language Fricatives
  • Example 34
  • Example 34 (2)
  • Elements of a Language Fricatives (2)
  • Elements of a Language Whisper
  • Figure 324 - Fricatives
  • Figure 323
  • Elements of a Language Plosives
  • Elements of a Language Plosives (2)
  • Elements of a Language Plosives (3)
  • Elements of a Language Plosives (4)
  • Elements of a Language Transitional Speech Sounds
  • Elements of a Language Transitional Speech Sounds (2)
  • Figure 328 ndash Liquids amp Glides
  • Elements of a Language Transitional Speech Sounds (3)
  • Coarticulation
  • Prosody The Melody of Speech
  • Figure 329 - Prosody
  • Figure 330 ndash Global Coarticulation

Recommended