Representations 1SGN-24006
Mid-level representationsfor audio content analysis*Slides for this lecture were created by Anssi Klapuri
Sources:Ellis, Rosenthal, Mid-level representations for Computational Auditory Scene Analysis, IJCAI, 1995.Schörkhuber et al. A Matlab Toolbox for Efficient Perfect Reconstruction Time-FrequencyTransforms with Log-Frequency Resolution, AES 2014.Virtanen, Audio Signal Modeling with Sinusoids plus Noise, MSc thesis, TUT, 2001.
ContentsIntroductionDesirable properties of mid-level representationsSTFT spectrogramConstant-Q transformSinusoids plus noisePerceptually-motivated representations
Representations 2SGN-240061 Introduction
The concept of mid-level data representations is useful incharacterizing signal analysis systemsThe analysis process can be viewed as a sequence ofrepresentations from an acoustic signal ( low level ) towardsthe analysis result ( high )Usually intermediate abstraction levels are needed between thetwo since high-level information is not readily visible in the rawacoustic signalAn appropriate mid-level representation functions as aninterface for further analysis and facilitates the design of
efficient algorithms
Representations 3SGN-24006Desirable properties of mid-level
representationsIt is natural to ask if a certain mid-level representation is betterthan others in a given task.Ellis and Rosenthal list several desirable qualities for a mid-level representation:
Component reduction: the number of objects in the representationis smaller and the meaningfulness of each is higher compared tothe individual samples of the input signalThe sound should be decomposed into sufficiently fine-grainedelements so as to support sound source separation by groupingthe elements to their sound sources.Invertibility: the original sound can be resynthesized from itsrepresentation in a perceptually accurate way,Psychoacoustic plausibility of the representation.
Representations 4SGN-24006Categorization of some mid-level
representationsEllis and Rosenthal classify representations according to threeconceptual axes
choice between fixed and variable bandwidth of the initialfrequency analysisdiscreteness: is representation structured as meaningful chunks?dimensionality of the transform (some possess extra dimensions)
Figure: [Ellis-Rosenthal]
Representations 5SGN-240062 Complex-valued STFT
spectrogramSTFT = short-time Fourier transformTime-domain signal x(n) is transformed into time-frequency domain by employing discrete Fourier transform(DFT) in successive time frames
complex spectra: all information is preservedamount of data remains the same
To some extent fulfills the criteria ofsupporting sound source separation (sources overlap less in time-frequency than in time domain)invertibility: the original sound can be perfectly reconstructedpsychoacoustic plausibility: frequency analysis (albeit in adifferent form) happens in the auditory system too
Representations 6SGN-24006
Example signal(music)
top: time-domain(zoomed in)middle:time-domainbottom: STFTspectrogram(magnitudes)
STFT spectrogram
Representations 7SGN-24006Spectrum estimation
Spectrum of audio signals is typically estimated in shortconsecutive segments, framesWhy?
the Fourier transform models the signal with stationary sinusoids(constant spectrum)real audio signals are not stationary but vary through timeframewise processing assumes the signal is time-invariant in shortenough frames
For audio signals, the frame length typically variesbetween 10ms 100ms, depending on the application
for speech signals often 25msTransient-like sounds are difficult to represent andprocess in the frequency domain
time blurring (but let s see constant-Q transform later...)
Representations 8SGN-24006Windowing
Windowing is essential in frame-wise processingweight the signal with a window function w(k) prior to transformas a rule of thumb, windowing is always needed: one cannot justtake a short part of a signal without windowing
1
0
)()()(N
n
nkmm WnwnxkX)()( nwnxm1,...,0),( Nnnxm
signal in frame m: windowed signal: short-time spectrum:
Representations 9SGN-24006Windowing
Example: spectrum of a sinusoid with/without windowing1. No windowing (=rectangular window), sinusoid at a spectral bin2. No windowing, random off-bin frequency spectral blurring!3. Hanning window, sinusoid at a spectral bin4. Hanning window, random off-bin frequency ok
There are different types of windows, but most importantis not to forget windowing altogether
Representations 10SGN-24006Windowing in framewise processing
Figure: Hanning windows (Hamming works too)adjacent windows sum to unity when frames overlap 50%all parts of the signal get an equal weight
In each frame, the signal is weighted with the window function andshort-time discrete Fourier transform is calculated
This yields a spectrogramcomplex spectrum in each frame over time
framewise processing
timefrequ
ency
time (ms)
Representations 11SGN-24006Windowing in analysis-synthesis systems
Sine window is useful in analysis-synthesis systems (see Figure)Windowing is done again in resynthesis to avoid artefacts at frameboundaries in the case that the signal is manipulated in the f-domain
Figure below: 50% frame overlap leads to perfect reconstruction if nothingis done at subbands
window
ing
DFT
signal inone frame ...
processing atsubbands(freq. domain) ...
inverseD
FT
window
ing
output
(overlap-addframes)
Representations 12SGN-24006Reconstructing the time domain signal:
overlap-add techniqueReconstructing a signal from its spectrogram:1. inverse Fourier transform the spectrum of each frame back to
time domain2. apply windowing in each frame (e.g. sine or Hanning window)3. successive frames are positioned to overlap 50% or more, and
summed sample-by-sample
Representations 13SGN-24006
Time-frequency representation where the frequency bins are uniformlydistributed in log-frequency and their Q-factors (ratios of centerfrequencies to bandwidths) are all equalIn effect, that means that the frequency resolution is better for lowfrequencies and the time resolution is better for high frequenciesMusically and perceptually motivated
frequency resolution of the inner ear is approx. constant Q above 500 Hzin music (equal temperament), note frequencies obey Fk = 440Hz 2k/12
3 Constant-Q transform (CQT)
(piano notes)
Representations 14SGN-24006
Mathematical definition:
where N is length of input signal, gk(m) is zero-centered window functionthat picks one time frame of the signal at point n and fs sample rateCompare CQT with short-time Fourier transform (STFT):
where now the window function h(m) is the same for all frequency binsIn CQT, to achieve constant Q-factors, the support of the window (timelength of significant non-zero values) is inversely proportional to fk
In CQT, the center frequencies are geometrically spaced:where B determines the number of bins per octave and f0 is lowest binIn DFT, the center frequencies are linearly spaced:
Constant-Q transform (CQT)
( , ) = ( ) ( - )e- 2 /
=0
-1
å
( , ) = ( ) ( - ) - 2 /
=0
-1
å
= 0 2 /
=
Representations 15SGN-24006Constant-Q transform
Time-domainwindow function
Frequencyresolution
Representations 16SGN-24006
CQT is essentially a wavelet transform, but with rather high frequencyresolution (typically 10 100 bins/octave)
conventional wavelet transform techniques cannot be usedMatlab Toolbox for CQT and ICQT[Schörkhuber et al. 2013]:
http://www.cs.tut.fi/sgn/arg/CQT/Efficient computation achieved by1. FFT of the entire input signal2. apply one CQT frequency-bin
wide bandpass on the (huge)spectrum
3. move subband around zero4. inverse-FFT transform the
narrowband spectrum to get CQTcoefficients over time for that bin
Toolbox for computing CQT
Representations 17SGN-24006
Due to the way that CQT is computed by the toolbox, the frequency-domain response of an individual frequency bin can be controlledperfectly (no sidelobes), but the effective time-domain window hassidelobes (that extend over the entire signal)Figure: the response of one time-frequency element as a function offrequency (left) and as a function of time (right)
Toolbox for computing CQTRepresentations 18
SGN-24006
Representations 19SGN-24006STFT spectrogram for the same signal
(either high or low frequencies blur)
Representations 20SGN-24006STFT spectrogram for the same signal
(either high or low frequencies blur)
Representations 21SGN-24006Constant-Q transform
A
Representations 22SGN-24006
Reasons why CQT has not more widely replaced FFT in audiosignal processing:
1. CQT is computationally more intensive than DFT spectrogram2. CQT produces a data structure that is more difficult to handle
than the time-frequency matrix (spectrogram)
Drawbacks of CQT
Representations 23SGN-24006
The toolbox includes PITCHSHIFT.m to implement thatPitch shifting is a natural operation in CQT domain:1. CQT2. translate CQT coefficients or in frequency3. retain phase coherence4. inverse CQT
Examples:original- 6 semitones+6 semitones
Transients at high freqs. are retained due to short frameTime stretching can be done by pitch shifting + resampling
Example application: Pitch shiftingRepresentations 24
SGN-240064 Sinusoids plus noise model(Recap from SGN-14006)
Signal model
signal x(t) is represented with N sinusoids (freq, amplitude, phase)and noise residual r(t)
Additive synthesisaccording to Fourier theorem, any signal can be represented as asum of sinusoidsmakes sense only for periodic signals, for which the amount ofsinusoids needed is smallnon-deterministic part would require a large number of sinusoids
use stochastic modeling
)()()(2cos)()(1
trtttftatxN
nnnn
Representations 25SGN-24006
Representations 26SGN-24006
Representations 27SGN-24006
Representations 28SGN-24006
Representations 29SGN-24006
Representations 30SGN-24006
Representations 31SGN-24006
Representations 32SGN-24006Sinusoids+noise model
AnalysisBlock diagram[Virtanen 2001]1. detect sinusoids
in framewise spectra2. estimate sinusoid
parameters andresynthesize
3. subtract sinusoidsfrom original signal
4. model the noiseresidual
We getsinusoid parametersnoise level at differentsubbands
Representations 33SGN-24006Sinusoids+noise model
Detecting and estimating sinusoidsBlock diagram: [Virtanen01]Spectral peaks areinterpreted as sinusoids1. peak : local maximum
in magnitude spectrum2. peak frequency, amplitude,
and phase can be pickedfrom the complex spectrum
Tracking the peaksdetected in successiveframes
gives parameters ofa time-varying sinusoidsinusoidal trajectory
time
Representations 34SGN-24006Sinusoids+noise model
Tracking the peaksIf needed, spectral peaks in successive frames can beassociated and joined into time-vayring sinusoids
frequency, amplitude, and phases joined into curves
Figure: peak tracking algorithm [Virtanen2001]based e.g. on the trackderivatives; try to forma smooth trackkill: if no continuation found,end the sinusoidbirth: if spectral peakis not a continuationfor an existing sinusoid,create a new one
Representations 35SGN-24006Sinusoids+noise model
Synthesis of sinusoidsAdditive synthesis
Often tracking the peaks is not necessary, butsynthesize sinusoids in each frame separately, keep theparameters fixed in one framewindow the obtained signal with Hann windowoverlap-add
N
nnnn tttftats
1
)()(2cos)()(
Representations 36SGN-24006Sinusoids+noise model
Synthesis, subtraction from originalSynthesized sinusoids vs. the original signal (upper panel)Residual obtained from subtraction (lower panel)
Representations 37SGN-24006Sinusoids+noise model
Modeling the noise residualResidual is obtained by subtacting synthesized sinusoidsfrom the original signal in the time-domainResidual signal is analyzed frame-by-frame
calculate spectrum Rt(f) in frame tsubdivide the spectrum into 25 perceptual subbands (Bark scale)calculate short-time energy at each band b,b=1,2,...,25
2)(bf
tt fRbE
Representations 38SGN-24006Sinusoids+noise model
Noise synthesis from parametersNoise residual is represented parametrically
in each frame, store only the short-time energies within Barkbands, Et(b)this modeling can be done, because the auditory system is notsensitive to energy changes within one Bark band in the case ofnoise
Synthesis1. generate magnitude spectrum, where the energy within each
Bark band is shared uniformly within the band2. generate random phases3. inverse Fourier transform to time-domain4. windowing with Hann window5. overlap-add
tt EfR )(
Representations 39SGN-24006Sinusoids+noise model
PropertiesAudio examples:http://www.cs.tut.fi/sgn/arg/music/tuomasv/sinusoids.htmlSinusoids+noise model has several nice properties
satisfies the component reduction property (see slide 3)invertibility: synthesized signal has reasonable quality (transientsounds are problematic)the model is generic: any sound can be processedstraightforward to compute (especially if peak tracking is skipped)manipulation such as time stretching and pitch shifting is easy
The representation also supports sounds sourceseparation to some extent, see next slides
Representations 40SGN-24006Intelligent component grouping
Auditory organization in humans has been found todepend on certain acoustic cuesTwo components may be grouped, i.e., associated to acommon sound source by1. Spectral proximity (time, frequency)2. Harmonic concordance
frequencies components in integral ratiosthese components are deduced to be produced by a common source
3. Synchronous changes of the componentscommon onset / offsetcommon AM / FM modulationequidirectional movement in the spectrum
4. Spatial proximity (angle of arrival)
Cues may compete and conflict
Representations 41SGN-24006Example: grouping sinusoids
Sinusoidal model reveals the cues better than the time-domain signal
original
sines
Representations 42SGN-24006Sound separation
Example: grouping implementedEstimate perceptual distance between each two spectralcomponents
sinusoids are classified into groups
Representations 43SGN-24006Sound separation
Example: grouping implementedClassified sets of sinusoids can be synthesized separately
Representations 44SGN-240065 Perceptually-motivated
representationsPeripheral hearing1. Frequency selectivity of
the inner earBank of linear bandpass filters
auditory channels2. Mechanical-to-neural
transductionCompression, rectification,lowpass filteringDetailed models exist, too
In brain, for pitch processing:3. Periodicity analysis within channels4. Combination across channels
Between-channel phase differences do not affect
Signal in theauditory nerve
Brain
not directlyobservable (yet)
Representations 45SGN-24006Perceptually-motivated
representationsThe signal traveling in the auditory nerve fibers from theauditory periphery to the brain can be viewed as a mid-level representationThe idea of using the same data representation as thehuman auditory system is very appealingAuditory periphery is quite accurately known
Representations 46SGN-24006Auditory filterbank
Band-wise processing is an inherent part of hearingFigure: Frequency responses (top) and impulse response (bottom) ofa few auditory filters
Bandwidths proportional to center frequency:0.108 24.7c cb f Hz
Gammatone filters[Slaney93]
Representations 47SGN-24006Mechanical-to-neural transduction
Simplified model:a. Compression
(and level adaptation)b. Half-wave rectificationc. Lowpass filtering
Compression:Memoryless: scale the signalwith a factor
ac= ( c) 1
where c is the std of the signal within channel cSpectral flattening (whitening) when 0< <1
Representations 48SGN-24006Mechanical-to-neural transduction
Half-wave rectificationHalf-wave rectification within subbands
input partials + beating partials (freq. intervals btw the input partials)
Representations 49SGN-24006Mechanical-to-neural transduction
Half-wave rectification
Rectification maps the contribution of higher-order partials to theposition of the F0 and its few multiples in the spectrumThe extent to which harmonic h is mapped to the position of thefundamental increases as a function of h
Harmonic sound Amplitude-modulated noise
Representations 50SGN-24006Autocorrelation within channels
Autocorrelation:
Summary autocorrelation: combining across channels
Representations 51SGN-24006Correlogram
Performing periodicity analysis within critical bands produces a three-dimensional volume rc(n , for channel c at time n and lagFigure below illustrates the correlogram
Input signal was a trumpet sound with F0 260 Hz (period 3.8 ms)left: the 3D correlogram volume.middle: zero-lag face of the correlogram (= power spectrogram)right: one time slice of the volume, from which summary ACF canbe obtained by summing over frequency.