E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 6:Nonspeech and Music
Music and nonspeech
Environmental sounds
Music synthesis techniques
Sinewave synthesis
Music analysis
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeringSpring 2006
1
2
3
4
5
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 2
Music & nonspeech
• What is ‘nonspeech’?
- according to research effort: a little music- in the world: most everything
attributes?
1
Originnatural man-made
Info
rmat
ion
cont
ent
low
high
wind &water
animalsounds
speech music
machines & enginescontact/
collision
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 3
Sound attributes
• Attributes suggest model parameters
• What do we notice about ‘general’ sound?
- psychophysics: pitch, loudness, ‘timbre’- bright/dull; sharp/soft; grating/soothing- sound is not ‘abstract’:
tendency is to describe by source-events
• Ecological perspective
- what matters about sound is ‘what happened’
→
our percepts express this more-or-less directly
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 4
Motivations for modeling
• Describe/classify
- cast sound into model because want to use the resulting parameters
• Store/transmit
- model implicitly exploits limited structure of signal
• Resynthesize/modify
- model separates out interesting parameters
Sound Model parameterspace
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 5
Analysis and synthesis
• Analysis is the converse of synthesis:
• Can exist apart:
- analysis for classification- synthesis of artificial sounds
• Often used together:
- encoding/decoding of compressed formats- resynthesis based on analyses- analysis-by-synthesis
Sound
AnalysisSynthesis
Model / representation
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 6
Outline
Music and nonspeech
Environmental sounds
- Collision sounds- Sound textures
Music synthesis techniques
Sinewave synthesis
Music analysis
1
2
3
4
5
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 7
Environmental Sounds
• Where sound comes from:mechanical interactions
- contact / collisions- rubbing / scraping- ringing / vibrating
• Interest in environmental sounds
- carry information about events around us.. including indirect hints
- need to create them in virtual environments.. including soundtracks
• Approaches to synthesis
- recording / sampling- synthesis algorithms
2
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 8
Collision sounds
• Factors influencing:
- colliding bodies: size, material, damping- local properties at contact point (hardness)- energy of collision
• Source-filter model
- “source” = excitation of collision event(energy, local properties at contact)
- “filter” = resonance and radiation of energy(body properties)
• Variety of strike/scraping sounds
- resonant freqs ~ size/shape- damping ~ material- HF content in excitation/strike ~
mallet, force(from Gaver 1993)
t→
f→
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 9
Sound textures
• What do we hear in:
- a city street- a symphony orchestra
• How do we distinguish:
- waterfall- rainfall- applause- static
• Levels of ecological description...
time / s
freq
/ H
z
Applause04
0 1 2 3 40
1000
2000
3000
4000
5000
time / sfr
eq /
Hz
Rain01
0 1 2 3 40
1000
2000
3000
4000
5000
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 10
Sound texture modeling
(Athineos)
• Model broad spectral structure with LPC
- could just resynthesize with noise
• Model fine temporal structure in residual with linear prediction in time domain
- precise dual of LPC in frequency- ‘poles’ model temporal events
• Allows modification / synthesis?
TD-LPy[n] = iaiy[n-i] DCT
FD-LPE[k] = ΣΣ ibiE[k-i]Whitened
residualPer-framespectral
parameters
Per-frametemporal envelope
parameters
Residualspectrum
Sound
e[n]y[n] E[k]
-0.02
0
0.02
Temporal envelopes (40 poles, 256ms)
0.05 0.1 0.15 0.2 0.25time / sec
ampl
itude
0.01
0.02
0.03
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 11
Outline
Music and nonspeech
Environmental sounds
Music synthesis techniques
- Framework- Historical development
Sinewave synthesis
Music analysis
elements?
1
2
3
4
5
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 12
Music synthesis techniques
• What is music?
- could be anything
→
flexible synthesis needed!
• Key elements of conventional music
- instruments
→
note-events (time, pitch, accent level)
→
melody, harmony, rhythm- patterns of repetition & variation
• Synthesis framework:
instruments: common framework for many notesscore: sequence of (time, pitch, level) note events
3
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 13
The nature of musical instrument notes
• Characterized by instrument (register), note, loudness/emphasis, articulation...
distinguish how?
Time
Fre
quen
cy
Piano
0
1000
2000
3000
4000
0 1 2 3 4 Time
Violin
0
1000
2000
3000
4000
0 1 2 3 4
Time
Fre
quen
cy
Clarinet
0
1000
2000
3000
4000
0 1 2 3 4 Time
Trumpet
0
1000
2000
3000
4000
0 1 2 3 4
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 14
Development of music synthesis
• Goals of music synthesis:
- generate realistic / pleasant new notes- control / explore timbre (quality)
• Earliest computer systems in 1960s(voice synthesis, algorithmic)
• Pure synthesis approaches:
- 1970s: Analog synths- 1980s: FM (Stanford/Yamaha)- 1990s: Physical modeling, hybrids
• Analysis-synthesis methods:
- sampling / wavetables- sinusoid modeling- harmonics + noise (+ transients)
others?
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 15
Analog synthesis
• The minimum to make an ‘interesting’ sound
• Elements:
- harmonics-rich oscillators- time-varying filters- time-varying envelope- modulation: low frequency + envelope-based
• Result:
- time-varying spectrum, independent pitch
Oscillator Filter
Envelope
PitchTrigger
Vibrato
t
t
f
+
Cutofffreq
GainSound
++
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 16
FM synthesis
• Fast frequency modulation
→
sidebands:
- a harmonic series if
ω
c
=
r
·
ω
m
•
J
n
(β) is a Bessel function:
→ Complex harmonic spectra by varying β
ωct β ωmt( )sin+( )cos Jn β( ) ωc nωm+( )t( )cosn ∞–=
∞∑=
phase modulation
0 1 2 3 4 5 6 7 8 9-0.5
0
0.5
1J0
J1 J2 J3 J4Jn(β) ≈ 0 for β < n - 2
modulation index β
time / s
freq
/ H
z
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
1000
2000
3000
4000
whatuse?
ωc 2000Hz=
ωm 200Hz=
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 17
Sampling synthesis
• Resynthesis from real notes→ vary pitch, duration, level
• Pitch: stretch (resample) waveform
• Duration: loop a ‘sustain’ section
• Level: cross-fade different examples
- need to ‘line up’ source samples
-0.2
-0.1
0
0.1
0.2
0 0.1 0.2 time
0 0.002 0.004 0.006 0.008 time / s-0.2
-0.1
0
0.1
0.2
0 0.002 0.004 0.006 0.008 time / s
596 Hz 894 Hz
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
0 0.1 0.2 0.3 0 0.1 0.2 0.3time / s time / s
0.204 0.2060.174 0.176
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
0 0.05 0.1 0.15 0 0.05 0.1 0.15time / s time / s
Soft Loud
veloc
mix
good& bad?
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 18
Outline
Music and nonspeech
Environmental sounds
Music synthesis techniques
Sinewave synthesis (detail)- Sinewave modeling- Sines + residual ...
Music analysis
1
2
3
4
5
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 19
Sinewave synthesis
• If patterns of harmonics are what matter,why not generate them all explicitly:
- particularly powerful model for pitched signals
• Analysis (as with speech):
- find peaks in STFT |S[ω,n]| & track
- or track fundamental ω0 (harmonics / autoco)
& sample STFT at k·ω0
→set of Ak[n] to duplicate tone:
• Synthesis via bank of oscillators
4
s n[ ] Ak n[ ] k ω0 n[ ] n⋅ ⋅( )cosk∑=
0 0.05 0.1 0.15 0.2 time / s time / s0
2000
4000
6000
8000
freq
/ H
z
freq / Hz
mag
00.1
0.2
05000
0
1
2
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 20
Steps to sinewave modeling - 1
• The underlying STFT:
What value for N (FFT length & window size)?What value for H (hop size: n0 = r·H, r = 0, 1, 2...)?
• STFT window length determines freq. resol’n:
• Choose N long enough to resolve harmonics→ 2-3x longest (lowest) fundamental period- e.g. 30-60 ms = 480-960 samples @ 16 kHz- choose H ≤ N/2
• N too long → lost time resolution- limits sinusoid amplitude rate of change
X k n0,[ ] x n n0+[ ] w n[ ] j2πkn
N------------- –exp⋅ ⋅
n 0=
N 1–
∑=
Xw ejω
( ) X ejω
( ) W ejω
( )*=
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 21
Steps to sinewave modeling - 2
• Choose candidate sinusoids at each time by picking peaks in each STFT frame:
• Quadratic fit for peak:
+ linear interpolation of unwrapped phase
time / s
freq
/ H
zle
vel /
dB
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180
2000
4000
6000
8000
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60
-40
-20
0
20
400 600 800 freq / Hz
-20
-10
0
10
20
400 600 800 freq / Hz
-10
-5
0
leve
l / d
B
phas
e / r
ad
y
x
y = ax(x-b)
b/2
ab2/4
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 22
Steps to sinewave modeling - 3
• Which peaks to pick?Want ‘true’ sinusoids, not noise fluctuations- ‘prominence’ threshold above smoothed spec.
• Sinusoids exhibit stability...- of amplitude in time- of phase derivative in time→compare with adjacent time frames to test?
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60
-40
-20
0
20
leve
l / d
B
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 23
Steps to sinewave modeling - 4
• ‘Grow’ tracks by appending newly-found peaks to existing tracks:
- ambiguous assignments possible
• Unclaimed new peak- ‘birth’ of new track- backtrack to find earliest trace?
• No continuation peak for existing track- ‘death’ of track- or: reduce peak threshold for hysteresis
time
freq
death
birth
existingtracks
newpeaks
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 24
Resynthesis of sinewave models
• After analysis, each track defines contours in frequency, amplitude fk[n], Ak[n] (+ phase?)
- use to drive a sinewave oscillators & sum up
• ‘Regularize’ to exactly harmonic fk[n] = k·f0[n]
0 0.05 0.1 0.15 0.2500
600
7000
1
2
3
0 0.05 0.1 0.15 0.2time / s
time / sfreq
/ Hz
leve
l
-3
-2
-1
0
1
2
3
Ak[n]·cos(2πfk[n]·t)
fk[n]
Ak[n]
n
0 0.05 0.1 0.15 0.20
2000
4000
6000
0 0.05 0.1 0.15 0.2550
600
650
700
time / s time / s
freq
/ H
z
freq
/ H
z
whatto do?
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 25
Modification in sinewave resynthesis
• Change duration by warping timebase- may want to keep onset unwarped
• Change pitch by scaling frequencies- either stretching or resampling envelope
• Change timbre by interpolating params
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
1000
2000
3000
4000
5000
time / s
freq
/ H
z
0 1000 2000 3000 40000
10
20
30
40
freq / Hz
leve
l / d
B
0 1000 2000 3000 40000
10
20
30
40
freq / Hzle
vel /
dB
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 26
Sinusoids + residual
• Only ‘prominent peaks’ became tracks- remainder of spectral energy was noisy?→ model residual energy with noise
• How to obtain ‘non-harmonic’ spectrum?- zero-out spectrum near extracted peaks?- or: resynthesize (exactly) & subtract waveforms
.. must preserve phase!
• Can model residual signal with LPC→flexible representation of noisy residual
es n[ ] s n[ ] Ak n[ ] 2πn f k n[ ]⋅( )cosk∑–=
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
mag
/ dB
-80
-60
-40
-20
0
20
original
sinusoids
residualLPC
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 27
Sinusoids + noise + transients
• Sound represented as sinusoids and noise:
Parameters are {Ak[n], fk[n]}, hn[n]
• Separate out abrupt transients in residual?
- more specific → more flexible
s n[ ] Ak n[ ] 2πn f k n[ ]⋅( )cosk∑ hn n[ ] b n[ ]*+=
Sinusoids Residual es n[ ]
time / s
freq
/ H
z
0 0.2 0.4 0.60
2000
4000
6000
8000
2000
4000
6000
8000
0 0.2 0.4 0.60
2000
4000
6000
8000
{Ak[n], fk[n]}
hn[n]
es n[ ] tk n[ ]k∑ hn n[ ] b' n[ ]*+=
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 28
Outline
Music and nonspeech
Environmental sounds
Music synthesis techniques
Sinewave synthesis
Music analysis- Instrument identification- Pitch tracking
1
2
3
4
5
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 29
Music analysis
• What might we want to get out of music?
• Instrument identification- different levels of specificity- ‘registers’ within instruments
• Score recovery- transcribe the note sequence- extract the ‘performance’
• Ensemble performance- ‘gestalts’: chords, tone colors
• Broader timescales- phrasing & musical structure- artist / genre clustering and classification
5
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 30
Instrument identification
• Research looks for perceptual ‘timbre space’
• Cues to instrument identification- onset (rise time), sustain (brightness)
• Hierarchy of instrument families- strings / reeds / brass- optimize features at each level
bright
dull
low fluxhi flux
low attack
hi attack
procedure?
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 31
Pitch tracking• Fundamental frequency (→ pitch)
is a key attribute of musical sounds→pitch tracking as a key technology
• Pitch tracking for speech- voice pitch & spectrum highly dynamic- speech is voiced and unvoiced ground truth?
• Applications- voice coders (excitation description)- harmonic modeling
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 32
Pitch tracking for music
• Pitch in music- pitch is more stable (although vibrato)- but: multiple pitches
• Applications- harmonic modeling- music transcription (→ storage, resynthesis)- source separation
• Approaches: “place” & “time”
Time
Fre
quen
cy
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
1000
2000
3000
4000
??
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 33
Meddis & Hewitt pitch model
• Autocorrelation (time) based pitch extraction- fundamental period → peak(s) in autocorrelation
• Compute separately in each frequency band& ‘summarize’ across (perceptual) channels
x t( ) x t T+( )≈ rxx T( ) x t( )x t T+( )∫= max≈→
0 20 40 60 80 100-0.2-0.1
00.10.2
0 20 40 60 80 100-10123
time / samples lag / samples
Waveform x[n] Autocorrelation rxx[l]
80
328
866
1924
4000CF / Hz Autocorrelogram
2.5 5.0 7.5 10.0 12.50
1
lag / ms
Ban
dpas
sfil
ters
Rec
tific
atio
n &
low
-pas
s fil
ter
Periodicity detection
Cross-channel sum
sound
Summary ACG
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 34
Tolonen & Karjalainen simplification• Multiple frequency channels
can have different dominant pitches ...
• But equalizing (flattening) the spectrum works:
→ Summary AC as a function of time:
- ‘Enhancement’ = cancel subharmonics
Pre-whiteningsound
Highpass@ 1kHz
Rectify &low-pass
Lowpass@ 1kHz
Periodicitydetection
SACFenhance
ESACFPeriodicitydetection
+
time/s
50
100
200
400
1000f/Hz Periodogram for M/F voice mix
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Summary autocorrelation at t=0.775 s
0.001 0.002 0.003 0.004 0.006 0.01 0.02 lag/s
200 Hz(0.005s)
125 Hz(0.008s)
lag vs.freq?
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 35
Post-processing of pitch tracks
• Remove outliers with median filtering
• Octave errors are common:- if x(t) ≈ x(t + T ) then x(t) ≈ x(t + 2T ) etc.
→ dynamic programming/HMM
• Validity- “is there a pitch at this time?”- voiced/unvoiced decision for speech
• Event detection- when does a pitch slide indicate a new note?
time
5-pt median
E6820 SAPR - Dan Ellis L06 - Nonspeech & Music 2006-02-23 - 36
Summary
• ‘Nonspeech audio’- i.e. sound in general- characteristics: ecological
• Music synthesis- control of pitch, duration, loudness, articulation- evolution of techniques- sinusoids + noise + transients
• Music analysis- different aspects: instruments, pitches,
performance
and beyond?