EE E6820: Speech & Audio Processing & Recognition
Lecture 6:Nonspeech and Music
Dan Ellis <[email protected]>Michael Mandel <[email protected]>
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
February 26, 2009
1 Music & nonspeech
2 Environmental Sounds
3 Music Synthesis Techniques
4 Sinewave Synthesis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 1 / 30
Outline
1 Music & nonspeech
2 Environmental Sounds
3 Music Synthesis Techniques
4 Sinewave Synthesis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 2 / 30
Music & nonspeech
What is ‘nonspeech’?I according to research effort: a little musicI in the world: most everything
Originnatural man-made
Info
rmat
ion
cont
ent
low
high
wind &water
animalsounds
speech music
machines & enginescontact/
collision
attributes?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 3 / 30
Sound attributes
Attributes suggest model parameters
What do we notice about ‘general’ sound?I psychophysics: pitch, loudness, ‘timbre’I bright/dull; sharp/soft; grating/soothingI sound is not ‘abstract’:
tendency is to describe by source-events
Ecological perspectiveI what matters about sound is ‘what happened’→ our percepts express this more-or-less directly
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 4 / 30
Motivations for modeling
Describe/classifyI cast sound into model because want to use the resulting
parameters
Store/transmitI model implicitly exploits limited structure of signal
Resynthesize/modifyI model separates out interesting parameters
Sound Model parameter�space
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 5 / 30
Analysis and synthesis
Analysis is the converse of synthesis:
Sound
AnalysisSynthesis
Model / �representation
Can exist apart:I analysis for classificationI synthesis of artificial sounds
Often used together:I encoding/decoding of compressed formatsI resynthesis based on analysesI analysis-by-synthesis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 6 / 30
Outline
1 Music & nonspeech
2 Environmental Sounds
3 Music Synthesis Techniques
4 Sinewave Synthesis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 7 / 30
Environmental Sounds
Where sound comes from:mechanical interactions
I contact / collisionsI rubbing / scrapingI ringing / vibrating
Interest in environmental soundsI carry information about events around us
.. including indirect hintsI need to create them in virtual environments
.. including soundtracks
Approaches to synthesisI recording / samplingI synthesis algorithms
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 8 / 30
Collision soundsFactors influencing:
I colliding bodies: size, material, dampingI local properties at contact point (hardness)I energy of collision
Source-filter modelI ”source” = excitation of collision event
(energy, local properties at contact)I ”filter” = resonance and radiation of energy
(body properties)Variety of strike/scraping sounds
I resonant freqs ∼ size/shapeI damping ∼ materialI HF content in excitation/strike ∼ mallet, force
I (from Gaver, 1993)E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 9 / 30
Sound textures
What do we hear in:I a city streetI a symphony orchestra
How do we distinguish:I waterfallI rainfallI applauseI static
time / s
freq
/ Hz
Applause04
0 1 2 3 40
1000
2000
3000
4000
5000
time / s
freq
/ Hz
Rain01
0 1 2 3 40
1000
2000
3000
4000
5000
Levels of ecological description...
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 10 / 30
Sound texture modeling (Athineos)
Model broad spectral structure with LPCI could just resynthesize with noise
Model fine temporal structure in residual withlinear prediction in time domain
TD-LPy[n] = iaiy[n-i] DCT FD-LP
E[k] = ΣΣ ibiE[k-i]Whitenedresidual
Per-framespectral
parameters
Per-frametemporal envelope
parameters
ResidualspectrumSound
e[n]y[n] E[k]
I precise dual of LPC in frequencyI ‘poles’ model temporal events
-0.02
0
0.02
Temporal envelopes (40 poles, 256ms)
0.05 0.1 0.15 0.2 0.25time / sec
ampl
itude
0.010.020.03
Allows modification / synthesis?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 11 / 30
Outline
1 Music & nonspeech
2 Environmental Sounds
3 Music Synthesis Techniques
4 Sinewave Synthesis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 12 / 30
Music synthesis techniques
What is music?I could be anything → flexible synthesis needed!
Key elements of conventional musicI instruments→ note-events (time, pitch, accent level)→ melody, harmony, rhythm
I patterns of repetition & variation
Synthesis framework:
instruments: common framework for many notesscore: sequence of (time, pitch, level) note events
&&V?
########
cccc
Soprano
Alto
Tenor
Bass
Allegro „ ∑3
„ ∑3
„ ∑3
„ ∑3
.œ jœ Jœ jœ ŒHal - le - lu - jah!
.œ jœ jœ jœ ŒHal - le - lu - jah!.œ Jœ Jœ Jœ ŒHal - le - lu - jah!
.œ Jœ Jœ Jœ ŒHal - le - lu - jah!
.œ jœ Jœ jœ ‰ Rœ RœHal - le - lu - jah! Hal - le -
.œ jœ jœ jœ ‰ rœ rœHal - le - lu - jah! Hal - le -.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah! Hal - le -
.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah! Hal - le -
Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah! Hal - le - lu - jah! Hal -
Jœ jœ ‰ rœ rœ Jœ jœ ‰ jœlu - jah! Hal - le - lu - jah! Hal -
Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah! Hal - le - lu - jah! Hal -
Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah! Hal - le - lu - jah! Hal -
&&V?
########
S
A
T
B
7
Jœ œ Jœ œ Œle - lu - jah,
œ œ œ jœ œ Œle - lu - jah,
Jœ œ jœ œ Œle - lu - jah,œ œ œ Jœ œ Œle - lu - jah,
.œ jœ Jœ Jœ ŒHal - le - lu - jah,
.œ jœ jœ jœ ŒHal - le - lu - jah,.œ Jœ Jœ Jœ ŒHal - le - lu - jah,.œ Jœ Jœ Jœ ŒHal - le - lu - jah,
.œ jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah, Hal - le -
.œ jœ jœ jœ ‰ rœ rœHal - le - lu - jah, Hal - le -.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah, Hal - le -.œ Jœ Jœ Jœ ‰ Rœ RœHal - le - lu - jah, Hal - le -
Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah, Hal - le - lu - jah, Hal -jœ jœ ‰ rœ rœ jœ jœ ‰ jœlu - jah, Hal - le - lu - jah, Hal -
Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah, Hal - le - lu - jah, Hal -Jœ Jœ ‰ Rœ Rœ Jœ Jœ ‰ Jœlu - jah, Hal - le - lu - jah, Hal -
&&V?
########
S
A
T
B
11 œ œ œ œ Œle - lu - jah,
.œ jœ# œ Œle - lu - jah,œ œ œ Jœ œ Œle - lu - jah,œ œ œ œ Œle - lu - jah,
˙ œ œfor the Lord
˙ œ œfor the Lord
˙ œ œfor the Lord˙ œ œfor the Lord
Jœ jœ .œ Jœ œGod Om - ni - po - tent
jœ jœ .œ jœ œGod Om - ni - po - tent
Jœ jœ .œ Jœ œGod Om - ni - po - tentJœ Jœ
.œ Jœ œGod Om - ni - po - tent
˙ œ ‰ Rœ Rœreign - eth, Hal - le -
˙ œ ‰rœ rœ
reign - eth, Hal - le -˙ œ ‰ rœ rœreign - eth, Hal - le -˙ œ ‰ Rœ Rœreign - eth, Hal - le -
HAN044 #1/8
M E S S I A H44. chorus
G. F. Handel
© 2000 by Score on Line S.A.
HALLELUJAH!
http://www.score-on-line .com
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 13 / 30
The nature of musical instrument notesCharacterized by instrument (register),note, loudness/emphasis, articulation...
Time
Frequency
Piano
01000200030004000
0 1 2 3 4 Time
Violin
01000200030004000
0 1 2 3 4
Time
Frequency
Clarinet
01000200030004000
0 1 2 3 4 Time
Trumpet
01000200030004000
0 1 2 3 4
Distinguish how?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 14 / 30
Development of music synthesis
Goals of music synthesis:I generate realistic / pleasant new notesI control / explore timbre (quality)
Earliest computer systems in 1960s(voice synthesis, algorithmic)
Pure synthesis approaches:I 1970s: Analog synthsI 1980s: FM (Stanford/Yamaha)I 1990s: Physical modeling, hybrids
Analysis-synthesis methods:I sampling / wavetablesI sinusoid modelingI harmonics + noise (+ transients)
others?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 15 / 30
Analog synthesis
The minimum to make an ‘interesting’ sound
Oscillator Filter
Envelope
PitchTrigger
Vibrato
t
t
f
+
Cutoff�freq
GainSound
++
Elements:I harmonics-rich oscillatorsI time-varying filtersI time-varying envelopeI modulation: low frequency + envelope-based
Result:I time-varying spectrum, independent pitch
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 16 / 30
FM synthesisFast frequency modulation → sidebands:cos(ωct + β sin(ωmt)) =
∑∞n=−∞ Jn(β) cos((ωc + nωm)t)
I a harmonic series if ωc = rωm
Jn(β) is a Bessel function:
0 1 2 3 4 5 6 7 8 9-0.5
0
0.5
1J0 J1 J2 J3 J4 Jn(β) ≈ 0 �
for β < n - 2
modulation index β
→ Complex harmonic spectra by varying β
time / s
freq
/ Hz
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
1000
2000
3000
4000
I ωc = 2000 Hz, ωm = 200 HzI what use?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 17 / 30
Sampling synthesisResynthesis from real notes→ vary pitch, duration, level
Pitch: stretch (resample) waveform
0 0.002 0.004 0.006 0.008 time / s-0.2
-0.1
0
0.1
0.2
0 0.002 0.004 0.006 0.008 time / s
596 Hz 894 Hz
-0.2
-0.1
0
0.1
0.2
Duration: loop a ‘sustain’ section
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
0 0.1 0.2 0.3 0 0.1 0.2 0.3time / s time / s
0.204 0.2060.174 0.176
Level: cross-fade different examples
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
0 0.05 0.1 0.15 0 0.05 0.1 0.15time / s time / s
Soft Loud
veloc
mix
I need to ‘line up’ source samplesI good & bad?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 18 / 30
Outline
1 Music & nonspeech
2 Environmental Sounds
3 Music Synthesis Techniques
4 Sinewave Synthesis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 19 / 30
Sinewave synthesis
If patterns of harmonics are what matter,why not generate them all explicitly:s[n] =
∑k Ak [n] cos(k · ω0[n] · n)
I particularly powerful model for pitched signals
Analysis (as with speech):I find peaks in STFT |S [ω, n]| & trackI or track fundamental ω0 (harmonics / autocorrelation)
& sample STFT at k · ω0
→ set of Ak [n] to duplicate tone:
0 0.05 0.1 0.15 0.2 time / s time / s0
2000
4000
6000
8000
freq
/ Hz
freq / Hz
mag
00.1
0.2
05000
0
1
2
Synthesis via bank of oscillators
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 20 / 30
Steps to sinewave modeling - 1
The underlying STFT:
X [k , n0] =N−1∑n=0
x [n + n0] · w [n] · exp−j
(2πkn
N
)
I what value for N (FFT length & window size)?I what value for H (hop size: n0 = r · H, r = 0, 1, 2 . . . )?
STFT window length determines freq. resolution:Xw (e jω) = X (e jω) ∗W (e jω)
Choose N long enough to resolve harmonics
→ 2-3x longest (lowest) fundamental periodI e.g. 30-60 ms = 480-960 samples @ 16 kHzI choose H ≤ N/2
N too long → lost time resolutionI limits sinusoid amplitude rate of change
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 21 / 30
Steps to sinewave modeling - 2Choose candidate sinusoids at each timeby picking peaks in each STFT frame:
time / sfre
q / H
zle
vel /
dB
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.1802000400060008000
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60
-40-20
020
Quadratic fit for peak:
400 600 800 freq / Hz
-20-10
01020
400 600 800 freq / Hz
-10
-5
0
leve
l / d
B
phas
e / r
ad
y
xy = ax(x-b)
b/2
ab2/4
+ linear interpolation of unwrapped phaseE6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 22 / 30
Steps to sinewave modeling - 3
Which peaks to pick?Want ‘true’ sinusoids, not noise fluctuations
I ‘prominence’ threshold above smoothed spectrum
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60
-40
-20
0
20
leve
l / d
B
Sinusoids exhibit stability...I of amplitude in timeI of phase derivative in time→ compare with adjacent time frames to test?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 23 / 30
Steps to sinewave modeling - 4
‘Grow’ tracks by appending newly-found peaksto existing tracks:
timefreq
death
birthexisting�tracks
new�peaks
I ambiguous assignments possible
Unclaimed new peakI ‘birth’ of new trackI backtrack to find earliest trace?
No continuation peak for existing trackI ‘death’ of trackI or: reduce peak threshold for hysteresis
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 24 / 30
Resynthesis of sinewave modelsAfter analysis, each track defines contours infrequency, amplitude fk [n], Ak [n] (+ phase?)
I use to drive a sinewave oscillators & sum up
0 0.05 0.1 0.15 0.2500
600
7000
1
2
3
0 0.05 0.1 0.15 0.2time / s
time / sfreq
/ Hz
leve
l
-3-2-10123
Ak[n]·cos(2πfk[n]·t)
fk[n]
Ak[n]
n
‘Regularize’ to exactly harmonic fk [n] = k · f0[n]
0 0.05 0.1 0.15 0.20
2000
4000
6000
0 0.05 0.1 0.15 0.2550
600
650
700
time / s time / s
freq
/ Hz
freq
/ Hz
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 25 / 30
Modification in sinewave resynthesis
Change duration by warping timebaseI may want to keep onset unwarped
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
1000
2000
3000
4000
5000
time / s
freq
/ Hz
Change pitch by scaling frequenciesI either stretching or resampling envelope
0 1000 2000 3000 4000010203040
freq / Hz
leve
l / d
B
0 1000 2000 3000 4000010203040
freq / Hz
leve
l / d
B
Change timbre by interpolating parameters
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 26 / 30
Sinusoids + residual
Only ‘prominent peaks’ became tracksI remainder of spectral energy was noisy?→ model residual energy with noise
How to obtain ‘non-harmonic’ spectrum?I zero-out spectrum near extracted peaks?I or: resynthesize (exactly) & subtract waveforms
es [n] = s[n]−∑
k
Ak [n] cos(2πn · fk [n])
.. must preserve phase!
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
mag
/ dB
-80
-60
-40
-20
0
20
original
sinusoids
residualLPC
Can model residual signal with LPC→ flexible representation of noisy residual
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 27 / 30
Sinusoids + noise + transients
Sound represented as sinusoids and noise:
s[n] =∑k
Ak [n] cos(2πn · fk [n]) + hn[n] ∗ b[n]
Sinusoids Residual es [n]
Parameters are Ak [n], fk [n], hn[n]
time / s
freq
/ Hz
0 0.2 0.4 0.602000400060008000
2000400060008000
0 0.2 0.4 0.602000400060008000
Separate out abrupt transients in residual?es [n] =
∑k tk [n] + hn[n] ∗ b′[n]
I more specific → more flexible
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 28 / 30
Summary
‘Nonspeech audio’I i.e. sound in generalI characteristics: ecological
Music synthesisI control of pitch, duration, loudness, articulationI evolution of techniquesI sinusoids + noise + transients
Music analysis...
and beyond?
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 29 / 30
References
W.W. Gaver. Synthesizing auditory icons. In Proc. Conference on Human factors incomputing systems INTERCHI-93, pages 228–235. Addison-Wesley, 1993.
M. Athineos and D. P. W. Ellis. Autoregressive modeling of temporal envelopes. IEEETr. Signal Proc., 15(11):5237–5245, 2007. URLhttp://www.ee.columbia.edu/~dpwe/pubs/AthinE07-fdlp.pdf.
X. Serra and J. Smith III. Spectral Modeling Synthesis: A Sound Analysis/SynthesisSystem Based on a Deterministic Plus Stochastic Decomposition. Computer MusicJournal, 14(4):12–24, 1990.
T. S. Verma and T. H. Y. Meng. An analysis/synthesis tool for transient signals thatallows aflexible sines+ transients+ noise model for audio. In Proc. ICASSP, pagesVI–3573–3576, Seattle, 1998.
E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 30 / 30