No Slide Title -...

Parts I-II 1

Computational Audition

DeLiang Wang

Perception and Neurodynamics Lab

Ohio State University

http://www.cse.ohio-state.edu/pnl

Parts I-II 2

Human brain

Parts I-II 3

Three levels of analysis

Marrian framework for understanding complex

information processing systems (Marr’82)

Computational theory

• Goals of computation, appropriateness of the goal, general strategies

• Representation/Algorithm

• How to represent the input and the output

• Algorithms for transforming from one representation to another

• Implementation

• How can the representation and algorithm be realized physically

(architecture, hardware)?

Parts I-II 4

What is the goal of audition?

What is the goal of perception?

The perceptual systems are ways of seeking and extracting

information about the environment from sensory input (Gibson’66)

The purpose of vision is to produce a visual description of the

environment for the viewer (Marr’82)

By analogy, the purpose of audition is to produce an

auditory description of the environment for the listener

The description is unique to the listener, even though the physical

environment can be common for different listeners

Parts I-II 5

Outline of short course

Part I: Acoustics and signal processing background

Part II: Physiological and perceptual basis of audition

Part III: Fundamentals of computational audition

Part IV: Computational auditory scene analysis

Part V: Pattern recognition and learning

Discussion and conclusion

Parts I-II 6


Part I: Acoustics and Signal Processing

Background

DeLiang Wang




Parts I-II 7

Outline of Part I

Basic concepts in acoustics

Amplitude, phase, frequency/period

Power, intensity, decibels

Spectrum, formant, envelope

Pitch, loudness, timbre

Basic signal processing

Cepstrum

Linear prediction coding (LPC)

Parts I-II 8

Basics of sound

• Mathematics of the pure tone

or

• A: amplitude

• : phase

• T: period

• f: frequency

x(t) Asin(2t / T )

x(t) Asin(2ft )

Parts I-II 9

Phase lead – phase lag

Parts I-II 10

Power, intensity, and decibels

• Treat signal x(t) as voltage

• By Ohm's law, the current i(t) = x(t)/R

• Then the instantaneous power is

• Energy is the integrated power over a certain time period (e.g. kilowatt vs. kilowatthour)

P(t) x(t)i(t) x2(t) / R

Parts I-II 11

Power, intensity, and decibels (cont.)

• Treat signal x(t) as sound pressure

• Then the instantaneous intensity is

• I(t) is measured in watts/m2 (x(t): pressure, Newtons/m2 or pascals)

• ρ: the density of the medium

• c: speed of sound

I(t) x2(t) /(c)

Parts I-II 12

Sound level

• Ratio of one sound to another (baseline), expressed as decibels (dB)

• Note the use of common logarithm

• Double intensity leads to 3 dB, and double amplitude leads to 6 dB

• SNR: signal-to-noise ratio

• Conversational speech is about 65 dB. Above 100 dB is damaging to the ear

L2 L1(decibels) 10log10(I2 / I1)

Parts I-II 13

How loud are sounds?

Source:

http://www.handsandvoices.org

/resource_guide/055_audiogram.html

Parts I-II 14

Spectrum

• Fourier Series: For any periodic function of time, x(t), with period T, i.e.

x(t+mT) = x(t), for all integer m

• x(t) can be represented as a Fourier series like this

• Furthermore,

• n is integer and is the fundamental frequency

x(t) A0 [An cos(nt) Bn sin(nt )]n1

n n0 2n / T

0

Parts I-II 15

Spectrum (cont.)

• "The multiplicity of vibrational forms which can be thus produced by the composition of simple pendular vibrations is not merely extraordinarily great; it is so great that it can not be greater." (H. Holmholtz, 1863)

Parts I-II 16

Waveform illustration

A periodic waveform

Parts I-II 17

Spectrum illustration

Parts I-II 18


Parts I-II 19


Parts I-II 20

Fourier transform

• For any function of time, x(t), the Fourier transform X(w)

of x(t) is defined in terms of the Fourier integral:

• The Fourier transform converts a function of time to a

function of frequency

• Inverse Fourier transform

X() eit

x(t)dt

x(t) 1

2e

it

X()d

)sin()cos( wtite ti

Parts I-II 21

Pitch

• Definition: "that attribute of auditory sensation in terms

of which sounds may be ordered on a musical scale"

(American Standards Association, 1960)

• Pitch is related to the repetition rate of the waveform:

• For pure tone, pitch corresponds to its frequency

• For a periodic complex tone, pitch corresponds to its fundamental

frequency

Parts I-II 22

Envelope and formant

• Envelope: Amplitude variation (modulation)

• Formant: A resonance in the vocal tract which is usually manifested as a peak in the spectral envelope of a speech sound

Parts I-II 23

Formant illustration

Parts I-II 24

Three characteristics of sound sensation

• Pitch (fundamental frequency)

• Loudness (intensity)

• Timbre (quality - spectral envelope)

Parts I-II 25

Signal representations: Cepstrum

• Source-filter separation for sound production

• For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes, and filter corresponds to vocal tract (resonators)

• For music, source corresponds to vibrations (e.g. vibrating strings in plucked or bowed string instrument) and filter corresponds to the body of the instrument

• Overall signal reaching the ear is the convolution of source with the impulse response of filter

• Cepstral analysis attempts to separate source from filter, hence it can be viewed as deconvolution

tdthttxtxththtxty

)()()()()()()(

Parts I-II 26

Speech production illustration

Parts I-II 27

Real cepstrum

• For speech, the spectral magnitude can be written as

• Taking the logarithm yields

• Observation for speech production

• The E term corresponds to an event (e.g. a pulse train with a frequency of 100 Hz) more extended in time than the impulse response of the vocal tract. Analogously, E corresponds to “carrier” and V corresponds to “envelope” in the frequency domain. In other words, E varies more quickly with respect to ω than V

• Hence, one can apply some kind of “filter” to separate “high-frequency” components from “low-frequency” components, thus E term and V term

X() V() E()

log X( ) logV() log E( )

Parts I-II 28

Real cepstrum (cont.)

• Change of notations because the variable is frequency

rather than time

• Filtering -> liftering

• Frequency response -> quefrency response

• Spectrum -> cepstrum

• High (low) frequency components -> high (low) time components or

high (low) quefrency components

Parts I-II 29

Real cepstrum (cont.)

• The log-operation converts a multiplicative term into an additive term, which can be operated upon by a linear operation such as filtering. The cepstrum is defined as the inverse Fourier transform

• c(n) is called the nth cepstral coefficient

• Given separated cepstra for excitation and vocal tract, they can be inverted to give original spectral magnitudes

• Only a moderate number of ceptral coefficients (e.g. 10-14) is needed for many applications, including speech recognition

• Complex cepstrum exists as well

c(n) 1

2e

inlog X( )d

Parts I-II 30

Cepstral analysis illustration

Parts I-II 31

LPC for speech modeling

• The vocal tract can be modeled as a cascaded set of acoustic tubes, each corresponding to a resonator

• Furthermore, each resonator corresponds to a formant

• Complete vowel spectrum can be reasonably represented by six resonators

• A direct implementation of the spectral model is written as an all-pole filter in the complex z domain (z-transform is the discrete-time counterpart of the Laplace transform - generalized form of the Fourier transform):

• P is twice the number of resonators, aj’s are coefficients

H(z) 1

1 ajz j

j1P

Parts I-II 32

LPC illustration

Parts I-II 33

LPC (cont.)

• In the above system, the discrete-time response y(n) to the excitation x(n) can be written as

• In LPC, the coefficients are computed to give an approximation to the original signal. That is, one attempts to predict the speech signal by a linear, weighted sum of its previous values:

• is the linear predictor of y(n)

• The coefficients that produce the best approximation are called the linear prediction coefficients

y(n) x(n) ajy(n j)j1P

˜ y (n) a jy(n j)j1P

˜ y (n)

Parts I-II 34

LPC (cont.)

• The difference between the predictor and the original

signal is called the error signal, residual error, LPC

residual, or prediction error

• can be viewed as an approximation to the

excitation signal

e(n) y(n) ˜ y (n)

Parts I-II 35

Residual error illustration

Parts I-II 36

LPC (cont.)

• Computing the coefficients can be viewed as an

optimization problem, where square error is generally

used

• Various methods can be employed to find coefficients,

including gradient descent

D e2(n)

n0

N1

[y(n) ajy(n j)j1P

]2

n0

N1

Parts I-II 37

LPC (cont.)

• Properties of LPC representation

• For a harmonic signal, the (spectral) model spectrum tends to follow

(hug) harmonic peaks, but not harmonic valleys, hence yielding an

estimate of the envelope of the signal spectrum

• Too many coefficients will yield a good fit to signal spectrum, but

miss spectral envelope. On the other hand, too few coefficients will

miss formants. A reasonable number is between 10 and 20.

• Prediction error is significantly higher for unvoiced speech

• Compared to Fourier and cepstral analysis, LPC is more

directly related to vocal tract characteristics

Parts I-II 38

More LPC illustrations

Parts I-II 39


Parts I-II 40


Parts I-II 41

Spectral analysis via filterbanks

Parts I-II 42

Summary table

Parts I-II 43

Summary of Part I

• A number of important concepts in acoutics, e.g.

intensity and decibels

• Sensation of a signal is based on acoustic characteristics,

but not the same

• Advanced signal representations, e.g. cepstrum, highlight

important features

Parts I-II 44


Part II: Physiological and Perceptual

Basis of Audition

DeLiang Wang




Parts I-II 45

Outline of Part II

Physiological basis

Psychoacoustic basis

Auditory scene analysis

Real-world audition

Parts I-II 46

Human ear

A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers

Parts I-II 47

Traveling wave

• Different frequencies of sound give rise to maximum vibrations at different places along the basilar membrane

• The frequency of vibration at a given place is equal to that of the nearest stimulus component (resonance)

• Hence, the cochlea performs a frequency analysis

Parts I-II 48

Auditory nerve response

Parts I-II 49

Cochlear filtering model

The gammatone function

approximates

physiologically-recorded

impulse responses

n = filter order (typically 4)

b = bandwidth

f0 = centre frequency

= phase

Parts I-II 50

Gammatone filterbank

• Each position on the basilar membrane is simulated by a single gammatone filter with appropriate centre frequency and bandwidth

• A small number of filters (e.g. 32) are generally sufficient to cover the range 50-8 kHz

• Note variation in bandwidth with frequency (unlike Fourier analysis)

Parts I-II 51

Response to a pure tone

• Many channels respond, but those closest to tone frequency respond most strongly (place coding)

• The interval between successive peaks also encodes the tone frequency (temporal coding)

• Note propagation delay along the membrane model

Parts I-II 52

Beyond the periphery

• The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system • In comparison to the auditory

periphery, central parts of the auditory system are less understood

• Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions)

The auditory nerve

Parts I-II 53

Demonstrations by the Institute for Perception Research

(IPO) and the Acoustical Society of America

• Section I. Frequency Analysis and Critical Bands

• Demo 1. Cancelled harmonics (Track 1; 1 min and 33s)

• Demo 2. Critical bands by masking (2-6; 1:50)

• Section II. Sound Pressure, Power, Loudness

• Demo 4. Decibel scale (8-10; 1:57)

• Demo 5. Filtered noise (12-15; 1:50)

Basics of auditory perception

Parts I-II 54

• Section IV. Pitch

• Demo 12. Dependence of pitch on intensity (27-28; 0.48)

• Demo 19. Pitch streaming (36; 1:22)

• Demo 26. Scales with repetition pitch (49-51; 1:25)

• Demo 27. Circularity in pitch judgement, or Shepard scale illusion

(52; 1:20)

Basics of auditory perception (cont.)

Parts I-II 55

• Section V. Timbre

• Demo 28. Effect of spectrum on timbre (53; 1:17)

• Section VI. Beats, Combination Tones, Distortion,

Echoes

• Demo 32. Primary and secondary beats (62; 1:32)

• Demo 35. Effect of echoes (70; 1:47)

Basics of auditory perception (cont.)

Parts I-II 56

Auditory scene analysis (ASA)

• Listeners are capable of parsing an acoustic scene (a

sound mixture) to form a mental representation of each

sound source – stream – in the perceptual process of

auditory scene analysis (Bregman’90)

• From events to streams

• Two conceptual processes of ASA:

• Segmentation. Decompose the acoustic mixture into sensory

elements (segments)

• Grouping. Combine segments into streams, so that segments in the

same stream originate from the same source

Parts I-II 57

Simultaneous organization

Simultaneous organization groups sound components

that overlap in time. ASA cues for simultaneous

organization

• Proximity in frequency (spectral proximity)

• Common periodicity

• Harmonicity

• Fine temporal structure

• Common spatial location

• Common onset (and to a lesser degree, common offset)

• Common temporal modulation

• Amplitude modulation (AM)

• Frequency modulation (FM)

Parts I-II 58

Sequential organization

Sequential organization groups sound components across

time. ASA cues for sequential organization

• Proximity in time and frequency

• Temporal and spectral continuity

• Common spatial location; more generally, spatial continuity

• Smooth pitch contour

• Smooth format transition?

• Rhythmic structure

• Rhythmic attention theory (Large & Jones’99)

Parts I-II 59

ASA demos (Bregman and Ahad)

• Simultaneous organization

• Demo 19. Spectral fusion based on common frequency change (Track 19; 1:00)

• Sequential organization

• Demo 1. Stream segregation in a cycle of six tones (1; 0.47)

• Demo 7. Streaming in African xylophone music (7; 1:30)

– Notes chosen from pentatonic scale

• Demo 11. Stream segregation of vowels and diphthongs (11; 1:17)

• Demo 14. Stream segregation of high and low bands of noise (14; 0:44)

Parts I-II 60

Primitive versus schema-based organization

The grouping process involves two aspects:

Primitive grouping. Innate data-driven mechanisms,

consistent with those described by Gestalt psychologists for

visual perception (proximity, similarity, common fate, good

continuation, etc.)

It is domain-general, and exploits intrinsic structure of

environmental sound

Grouping cues described earlier are primitive in nature

Schema-driven grouping. Learned knowledge about

speech, music and other environmental sounds

Model-based or top-down

It is domain-specific, e.g. organization of speech sounds into

syllables

Parts I-II 61

Organisation in speech: Broadband spectrogram

offset

synchrony

onset

synchrony

common

AM

continuity

“… pure pleasure … ”

harmonicity

Parts I-II 62

Organisation in speech: Narrowband spectrogram

offset

synchrony

onset

synchrony

continuity

“… pure pleasure … ”

harmonicity

Parts I-II 63

Real-world audition

What?

• Source type

• Speech

message

speaker

age, gender, linguistic origin, mood, …

• Music

• Car passing by

Where?

• Left, right, up, down

• How close?

Channel characteristics

Environment characteristics

• Room configuration

• Ambient noise

Parts I-II 64

Sources of intrusion and distortion

additive noise from

other sound sources

reverberation from

surface reflections

Parts I-II 65

Cocktail party problem

• Term coined by Cherry

• “One of our most important faculties is our ability to listen to, and

follow, one speaker in the presence of others. This is such a common

experience that we may take it for granted; we may call it ‘the

cocktail party problem’…” (Cherry’57)

• “For ‘cocktail party’-like situations… when all voices are equally

loud, speech remains intelligible for normal-hearing listeners even

when there are as many as six interfering talkers” (Bronkhorst &

Plomp’92)

Ball-room problem by Helmholtz

“Complicated beyond conception” (Helmholtz, 1863)

• Speech segregation problem

Parts I-II 66

Listener’s performance

Speech Reception

Threshold (SRT)

• The speech-to-noise ratio

needed for 50% intelligibility

• Each 1 dB gain in SRT

corresponds to about 10%

increase in intelligibility

(Miller et al.’51) dependent

upon materials

Source: Steeneken (1992)

Parts I-II 67

Effects of types of competing source

Source: Wang & Brown (2006)

SRT

Difference

(23 dB!)

Parts I-II 68

• Overall SNR is set to 0 dB

• Noise-Noise: pink , white , pink+white

• Speech-Speech:

• Noise-Tone:

• Noise-Speech:

• Tone-Speech:

Demo of effects of sound types on separation

Parts I-II 69

Phonemic restoration

• Demonstrations by Richard Warren and James Bashford • Demo 1. Homophonic temporal induction: Broadband noise (Track

2-3; 1:27)

• Demo 2. Temporal induction of speech (10-15; 4:37)

Parts I-II 70

Bregman figure

Pattern Completion

Parts I-II 71

Location

Source: Bronkhorst

& Plomp (1992)

SRT gain

Parts I-II 72

Binaural versus 3D presentation

Source: Drullman & Bronkhorst (2000)

Parts I-II 73

Summary of Part II

• A whirlwind tour of auditory physiology

• Basic phenomena in auditory perception

• ASA cues essentially reflect structural coherence of a

sound source. A subset of cues believed to be strongly

involved in ASA:

• Simultaneous organization: Periodicity, temporal modulation, onset

• Sequential organization: Location, pitch contour and other source

characteristics (e.g. vocal tract)

• Everyday audition has to contend with additive noise,

reverberation and channel distortions

Date post:	20-Aug-2018
Category:	Documents
Upload:	phunghanh
View:	218 times
Download:	0 times

No Slide Title -...

Documents