Parts I-II 1
Computational Audition
DeLiang Wang
Perception and Neurodynamics Lab
Ohio State University
http://www.cse.ohio-state.edu/pnl
Parts I-II 2
Human brain
Parts I-II 3
Three levels of analysis
Marrian framework for understanding complex
information processing systems (Marr’82)
Computational theory
• Goals of computation, appropriateness of the goal, general strategies
• Representation/Algorithm
• How to represent the input and the output
• Algorithms for transforming from one representation to another
• Implementation
• How can the representation and algorithm be realized physically
(architecture, hardware)?
Parts I-II 4
What is the goal of audition?
What is the goal of perception?
The perceptual systems are ways of seeking and extracting
information about the environment from sensory input (Gibson’66)
The purpose of vision is to produce a visual description of the
environment for the viewer (Marr’82)
By analogy, the purpose of audition is to produce an
auditory description of the environment for the listener
The description is unique to the listener, even though the physical
environment can be common for different listeners
Parts I-II 5
Outline of short course
Part I: Acoustics and signal processing background
Part II: Physiological and perceptual basis of audition
Part III: Fundamentals of computational audition
Part IV: Computational auditory scene analysis
Part V: Pattern recognition and learning
Discussion and conclusion
Parts I-II 6
Computational Audition
Part I: Acoustics and Signal Processing
Background
DeLiang Wang
Perception and Neurodynamics Lab
Ohio State University
http://www.cse.ohio-state.edu/pnl
Parts I-II 7
Outline of Part I
Basic concepts in acoustics
Amplitude, phase, frequency/period
Power, intensity, decibels
Spectrum, formant, envelope
Pitch, loudness, timbre
Basic signal processing
Cepstrum
Linear prediction coding (LPC)
Parts I-II 8
Basics of sound
• Mathematics of the pure tone
or
• A: amplitude
• : phase
• T: period
• f: frequency
x(t) Asin(2t / T )
x(t) Asin(2ft )
Parts I-II 9
Phase lead – phase lag
Parts I-II 10
Power, intensity, and decibels
• Treat signal x(t) as voltage
• By Ohm's law, the current i(t) = x(t)/R
• Then the instantaneous power is
• Energy is the integrated power over a certain time period (e.g. kilowatt vs. kilowatthour)
P(t) x(t)i(t) x2(t) / R
Parts I-II 11
Power, intensity, and decibels (cont.)
• Treat signal x(t) as sound pressure
• Then the instantaneous intensity is
• I(t) is measured in watts/m2 (x(t): pressure, Newtons/m2 or pascals)
• ρ: the density of the medium
• c: speed of sound
I(t) x2(t) /(c)
Parts I-II 12
Sound level
• Ratio of one sound to another (baseline), expressed as decibels (dB)
• Note the use of common logarithm
• Double intensity leads to 3 dB, and double amplitude leads to 6 dB
• SNR: signal-to-noise ratio
• Conversational speech is about 65 dB. Above 100 dB is damaging to the ear
L2 L1(decibels) 10log10(I2 / I1)
Parts I-II 13
How loud are sounds?
Source:
http://www.handsandvoices.org
/resource_guide/055_audiogram.html
Parts I-II 14
Spectrum
• Fourier Series: For any periodic function of time, x(t), with period T, i.e.
x(t+mT) = x(t), for all integer m
• x(t) can be represented as a Fourier series like this
• Furthermore,
• n is integer and is the fundamental frequency
x(t) A0 [An cos(nt) Bn sin(nt )]n1
n n0 2n / T
0
Parts I-II 15
Spectrum (cont.)
• "The multiplicity of vibrational forms which can be thus produced by the composition of simple pendular vibrations is not merely extraordinarily great; it is so great that it can not be greater." (H. Holmholtz, 1863)
Parts I-II 16
Waveform illustration
A periodic waveform
Parts I-II 17
Spectrum illustration
Parts I-II 18
Spectrum illustration
Parts I-II 19
Spectrum illustration
Parts I-II 20
Fourier transform
• For any function of time, x(t), the Fourier transform X(w)
of x(t) is defined in terms of the Fourier integral:
• The Fourier transform converts a function of time to a
function of frequency
• Inverse Fourier transform
X() eit
x(t)dt
x(t) 1
2e
it
X()d
)sin()cos( wtite ti
Parts I-II 21
Pitch
• Definition: "that attribute of auditory sensation in terms
of which sounds may be ordered on a musical scale"
(American Standards Association, 1960)
• Pitch is related to the repetition rate of the waveform:
• For pure tone, pitch corresponds to its frequency
• For a periodic complex tone, pitch corresponds to its fundamental
frequency
Parts I-II 22
Envelope and formant
• Envelope: Amplitude variation (modulation)
• Formant: A resonance in the vocal tract which is usually manifested as a peak in the spectral envelope of a speech sound
Parts I-II 23
Formant illustration
Parts I-II 24
Three characteristics of sound sensation
• Pitch (fundamental frequency)
• Loudness (intensity)
• Timbre (quality - spectral envelope)
Parts I-II 25
Signal representations: Cepstrum
• Source-filter separation for sound production
• For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes, and filter corresponds to vocal tract (resonators)
• For music, source corresponds to vibrations (e.g. vibrating strings in plucked or bowed string instrument) and filter corresponds to the body of the instrument
• Overall signal reaching the ear is the convolution of source with the impulse response of filter
• Cepstral analysis attempts to separate source from filter, hence it can be viewed as deconvolution
tdthttxtxththtxty
)()()()()()()(
Parts I-II 26
Speech production illustration
Parts I-II 27
Real cepstrum
• For speech, the spectral magnitude can be written as
• Taking the logarithm yields
• Observation for speech production
• The E term corresponds to an event (e.g. a pulse train with a frequency of 100 Hz) more extended in time than the impulse response of the vocal tract. Analogously, E corresponds to “carrier” and V corresponds to “envelope” in the frequency domain. In other words, E varies more quickly with respect to ω than V
• Hence, one can apply some kind of “filter” to separate “high-frequency” components from “low-frequency” components, thus E term and V term
X() V() E()
log X( ) logV() log E( )
Parts I-II 28
Real cepstrum (cont.)
• Change of notations because the variable is frequency
rather than time
• Filtering -> liftering
• Frequency response -> quefrency response
• Spectrum -> cepstrum
• High (low) frequency components -> high (low) time components or
high (low) quefrency components
Parts I-II 29
Real cepstrum (cont.)
• The log-operation converts a multiplicative term into an additive term, which can be operated upon by a linear operation such as filtering. The cepstrum is defined as the inverse Fourier transform
• c(n) is called the nth cepstral coefficient
• Given separated cepstra for excitation and vocal tract, they can be inverted to give original spectral magnitudes
• Only a moderate number of ceptral coefficients (e.g. 10-14) is needed for many applications, including speech recognition
• Complex cepstrum exists as well
c(n) 1
2e
inlog X( )d
Parts I-II 30
Cepstral analysis illustration
Parts I-II 31
LPC for speech modeling
• The vocal tract can be modeled as a cascaded set of acoustic tubes, each corresponding to a resonator
• Furthermore, each resonator corresponds to a formant
• Complete vowel spectrum can be reasonably represented by six resonators
• A direct implementation of the spectral model is written as an all-pole filter in the complex z domain (z-transform is the discrete-time counterpart of the Laplace transform - generalized form of the Fourier transform):
• P is twice the number of resonators, aj’s are coefficients
H(z) 1
1 ajz j
j1P
Parts I-II 32
LPC illustration
Parts I-II 33
LPC (cont.)
• In the above system, the discrete-time response y(n) to the excitation x(n) can be written as
• In LPC, the coefficients are computed to give an approximation to the original signal. That is, one attempts to predict the speech signal by a linear, weighted sum of its previous values:
• is the linear predictor of y(n)
• The coefficients that produce the best approximation are called the linear prediction coefficients
y(n) x(n) ajy(n j)j1P
˜ y (n) a jy(n j)j1P
˜ y (n)
Parts I-II 34
LPC (cont.)
• The difference between the predictor and the original
signal is called the error signal, residual error, LPC
residual, or prediction error
• can be viewed as an approximation to the
excitation signal
e(n) y(n) ˜ y (n)
Parts I-II 35
Residual error illustration
Parts I-II 36
LPC (cont.)
• Computing the coefficients can be viewed as an
optimization problem, where square error is generally
used
• Various methods can be employed to find coefficients,
including gradient descent
D e2(n)
n0
N1
[y(n) ajy(n j)j1P
]2
n0
N1
Parts I-II 37
LPC (cont.)
• Properties of LPC representation
• For a harmonic signal, the (spectral) model spectrum tends to follow
(hug) harmonic peaks, but not harmonic valleys, hence yielding an
estimate of the envelope of the signal spectrum
• Too many coefficients will yield a good fit to signal spectrum, but
miss spectral envelope. On the other hand, too few coefficients will
miss formants. A reasonable number is between 10 and 20.
• Prediction error is significantly higher for unvoiced speech
• Compared to Fourier and cepstral analysis, LPC is more
directly related to vocal tract characteristics
Parts I-II 38
More LPC illustrations
Parts I-II 39
More LPC illustrations
Parts I-II 40
More LPC illustrations
Parts I-II 41
Spectral analysis via filterbanks
Parts I-II 42
Summary table
Parts I-II 43
Summary of Part I
• A number of important concepts in acoutics, e.g.
intensity and decibels
• Sensation of a signal is based on acoustic characteristics,
but not the same
• Advanced signal representations, e.g. cepstrum, highlight
important features
Parts I-II 44
Computational Audition
Part II: Physiological and Perceptual
Basis of Audition
DeLiang Wang
Perception and Neurodynamics Lab
Ohio State University
http://www.cse.ohio-state.edu/pnl
Parts I-II 45
Outline of Part II
Physiological basis
Psychoacoustic basis
Auditory scene analysis
Real-world audition
Parts I-II 46
Human ear
A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers
Parts I-II 47
Traveling wave
• Different frequencies of sound give rise to maximum vibrations at different places along the basilar membrane
• The frequency of vibration at a given place is equal to that of the nearest stimulus component (resonance)
• Hence, the cochlea performs a frequency analysis
Parts I-II 48
Auditory nerve response
Parts I-II 49
Cochlear filtering model
The gammatone function
approximates
physiologically-recorded
impulse responses
n = filter order (typically 4)
b = bandwidth
f0 = centre frequency
= phase
Parts I-II 50
Gammatone filterbank
• Each position on the basilar membrane is simulated by a single gammatone filter with appropriate centre frequency and bandwidth
• A small number of filters (e.g. 32) are generally sufficient to cover the range 50-8 kHz
• Note variation in bandwidth with frequency (unlike Fourier analysis)
Parts I-II 51
Response to a pure tone
• Many channels respond, but those closest to tone frequency respond most strongly (place coding)
• The interval between successive peaks also encodes the tone frequency (temporal coding)
• Note propagation delay along the membrane model
Parts I-II 52
Beyond the periphery
• The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system • In comparison to the auditory
periphery, central parts of the auditory system are less understood
• Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions)
The auditory nerve
Parts I-II 53
Demonstrations by the Institute for Perception Research
(IPO) and the Acoustical Society of America
• Section I. Frequency Analysis and Critical Bands
• Demo 1. Cancelled harmonics (Track 1; 1 min and 33s)
• Demo 2. Critical bands by masking (2-6; 1:50)
• Section II. Sound Pressure, Power, Loudness
• Demo 4. Decibel scale (8-10; 1:57)
• Demo 5. Filtered noise (12-15; 1:50)
Basics of auditory perception
Parts I-II 54
• Section IV. Pitch
• Demo 12. Dependence of pitch on intensity (27-28; 0.48)
• Demo 19. Pitch streaming (36; 1:22)
• Demo 26. Scales with repetition pitch (49-51; 1:25)
• Demo 27. Circularity in pitch judgement, or Shepard scale illusion
(52; 1:20)
Basics of auditory perception (cont.)
Parts I-II 55
• Section V. Timbre
• Demo 28. Effect of spectrum on timbre (53; 1:17)
• Section VI. Beats, Combination Tones, Distortion,
Echoes
• Demo 32. Primary and secondary beats (62; 1:32)
• Demo 35. Effect of echoes (70; 1:47)
Basics of auditory perception (cont.)
Parts I-II 56
Auditory scene analysis (ASA)
• Listeners are capable of parsing an acoustic scene (a
sound mixture) to form a mental representation of each
sound source – stream – in the perceptual process of
auditory scene analysis (Bregman’90)
• From events to streams
• Two conceptual processes of ASA:
• Segmentation. Decompose the acoustic mixture into sensory
elements (segments)
• Grouping. Combine segments into streams, so that segments in the
same stream originate from the same source
Parts I-II 57
Simultaneous organization
Simultaneous organization groups sound components
that overlap in time. ASA cues for simultaneous
organization
• Proximity in frequency (spectral proximity)
• Common periodicity
• Harmonicity
• Fine temporal structure
• Common spatial location
• Common onset (and to a lesser degree, common offset)
• Common temporal modulation
• Amplitude modulation (AM)
• Frequency modulation (FM)
Parts I-II 58
Sequential organization
Sequential organization groups sound components across
time. ASA cues for sequential organization
• Proximity in time and frequency
• Temporal and spectral continuity
• Common spatial location; more generally, spatial continuity
• Smooth pitch contour
• Smooth format transition?
• Rhythmic structure
• Rhythmic attention theory (Large & Jones’99)
Parts I-II 59
ASA demos (Bregman and Ahad)
• Simultaneous organization
• Demo 19. Spectral fusion based on common frequency change (Track 19; 1:00)
• Sequential organization
• Demo 1. Stream segregation in a cycle of six tones (1; 0.47)
• Demo 7. Streaming in African xylophone music (7; 1:30)
– Notes chosen from pentatonic scale
• Demo 11. Stream segregation of vowels and diphthongs (11; 1:17)
• Demo 14. Stream segregation of high and low bands of noise (14; 0:44)
Parts I-II 60
Primitive versus schema-based organization
The grouping process involves two aspects:
Primitive grouping. Innate data-driven mechanisms,
consistent with those described by Gestalt psychologists for
visual perception (proximity, similarity, common fate, good
continuation, etc.)
It is domain-general, and exploits intrinsic structure of
environmental sound
Grouping cues described earlier are primitive in nature
Schema-driven grouping. Learned knowledge about
speech, music and other environmental sounds
Model-based or top-down
It is domain-specific, e.g. organization of speech sounds into
syllables
Parts I-II 61
Organisation in speech: Broadband spectrogram
offset
synchrony
onset
synchrony
common
AM
continuity
“… pure pleasure … ”
harmonicity
Parts I-II 62
Organisation in speech: Narrowband spectrogram
offset
synchrony
onset
synchrony
continuity
“… pure pleasure … ”
harmonicity
Parts I-II 63
Real-world audition
What?
• Source type
• Speech
message
speaker
age, gender, linguistic origin, mood, …
• Music
• Car passing by
Where?
• Left, right, up, down
• How close?
Channel characteristics
Environment characteristics
• Room configuration
• Ambient noise
Parts I-II 64
Sources of intrusion and distortion
additive noise from
other sound sources
reverberation from
surface reflections
Parts I-II 65
Cocktail party problem
• Term coined by Cherry
• “One of our most important faculties is our ability to listen to, and
follow, one speaker in the presence of others. This is such a common
experience that we may take it for granted; we may call it ‘the
cocktail party problem’…” (Cherry’57)
• “For ‘cocktail party’-like situations… when all voices are equally
loud, speech remains intelligible for normal-hearing listeners even
when there are as many as six interfering talkers” (Bronkhorst &
Plomp’92)
Ball-room problem by Helmholtz
“Complicated beyond conception” (Helmholtz, 1863)
• Speech segregation problem
Parts I-II 66
Listener’s performance
Speech Reception
Threshold (SRT)
• The speech-to-noise ratio
needed for 50% intelligibility
• Each 1 dB gain in SRT
corresponds to about 10%
increase in intelligibility
(Miller et al.’51) dependent
upon materials
Source: Steeneken (1992)
Parts I-II 67
Effects of types of competing source
Source: Wang & Brown (2006)
SRT
Difference
(23 dB!)
Parts I-II 68
• Overall SNR is set to 0 dB
• Noise-Noise: pink , white , pink+white
• Speech-Speech:
• Noise-Tone:
• Noise-Speech:
• Tone-Speech:
Demo of effects of sound types on separation
Parts I-II 69
Phonemic restoration
• Demonstrations by Richard Warren and James Bashford • Demo 1. Homophonic temporal induction: Broadband noise (Track
2-3; 1:27)
• Demo 2. Temporal induction of speech (10-15; 4:37)
Parts I-II 70
Bregman figure
Pattern Completion
Parts I-II 71
Location
Source: Bronkhorst
& Plomp (1992)
SRT gain
Parts I-II 72
Binaural versus 3D presentation
Source: Drullman & Bronkhorst (2000)
Parts I-II 73
Summary of Part II
• A whirlwind tour of auditory physiology
• Basic phenomena in auditory perception
• ASA cues essentially reflect structural coherence of a
sound source. A subset of cues believed to be strongly
involved in ASA:
• Simultaneous organization: Periodicity, temporal modulation, onset
• Sequential organization: Location, pitch contour and other source
characteristics (e.g. vocal tract)
• Everyday audition has to contend with additive noise,
reverberation and channel distortions