-28 - 1
EE E6820: Speech & Audio Processing & Recognition
ring
E6820 SAPR - Dan Ellis L01 - 2002-01
Lecture 1: Introduction & DSP
Sound and information
Course structure
DSP review: Timescale modification
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeSpring 2002
1
2
3
-28 - 2
Sound and information
voltage
1
n
ir
e
E6820 SAPR - Dan Ellis L01 - 2002-01
• Sound is air pressure variation
• Transducers convert air pressure ↔↔↔↔
Mechanical vibratio
Pressure waves in a
Motion of sensor
Time-varying voltag
+ + + +
t
v(t)
-28 - 3
What use is sound?
antage
ionornersal sounds’
5
5
E6820 SAPR - Dan Ellis L01 - 2002-01
• Footsteps examples:
• Hearing confers an evolutionary adv- useful information, complements vis- ...at a distance, in the dark, around c- listeners are highly adapted to ‘natur
(including speech)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-0.5
0
0.5
time / s
-28 - 4
The scope of audio processing
abstract
E6820 SAPR - Dan Ellis L01 - 2002-01
AUDIO
PROCESSING
natural
man-made
simple
-28 - 5
The acoustic communication chain
lus effect
r decoder
!
ion
E6820 SAPR - Dan Ellis L01 - 2002-01
• Sound is an information bearer
• Received sound reflects source(s) pof environment (channel)
message signal channel receive
synthesis audioprocessing recognit
-28 - 6
Levels of abstraction
between
rent tasks
xplicit ...
E6820 SAPR - Dan Ellis L01 - 2002-01
• Much processing concerns shiftinglevels of abstraction
• Different representations serve diffe- separating aspects, making things e
sound p(t)
representation(e.g. t-f energy)
‘information’
abstract
concrete
An
alys
is
Syn
thesis
-28 - 7
Course structure
ocessingsp. ASR)
2
E6820 SAPR - Dan Ellis L01 - 2002-01
• Goals:- survey topics in sound analysis & pr- develop an intuition for sound signal- learn some specific technologies (es
• Course structure:- weekly assignments (25%)- midterm exam (25%)- final project (50%)
• Text:Speech and Audio Signal ProcessingBen Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7
-28 - 8
Web-based
820/
ples, ...
etc.
E6820 SAPR - Dan Ellis L01 - 2002-01• Course website:http://www.ee.columbia.edu/~dpwe/e6
for lecture notes, problem sets, exam
• + student web pages for homework
-28 - 9
Course outline
ryion
L10:Sequenceecognition
L12:Systems &pplications
E6820 SAPR - Dan Ellis L01 - 2002-01
Fundamentals
L1:DSP
L2:Acoustics
L3:Pattern
recognition
L4:Audito
percept
Audio processing
L5:Signalmodels
L6:Music
analysis/synthesis
L7:Audio
compression
L8:Spatial sound& rendering
Speech recognition
L9:Speechfeatures r
L11:Recognizer
training a
28 - 10
Weekly Assignments
g Toolbox)g
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Research papers- journal & conference publications- summarize & discuss in class- written summaries on web page
• Practical experiments- MATLAB-based (+ Signal Processin- direct experience of sound processin- skills for project
• Book sections+ questions from book
28 - 11
Final Project
of grade)
s
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Most significant part of course (50%
• Oral proposals mid-semester; Presentations in final class+ website
• Scope- practical (Matlab recommended)- identify a problem; try some solution- evaluation
• Topic- few restrictions within world of audio- investigate other resources- develop in discussion with me
28 - 12
Examples of past projects
ll video
undtrack
usic
s etc.
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Detecting sound events in basketba- classifying ‘cheers’ in sport video so
• The S-Matrix: A novel approach to msegment detection- finding breaks between verse, choru
28 - 13
DSP review
3
time
g
ε
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Digital signals:
- sampling interval T,
sampling frequency
- quantizer
xd[n] = Q( xc(nT ) )
Discrete-time samplinlimits bandwidth
Discrete-levelquantization
limits dynamic range
Ωs2πT------=
Q y( ) ε y ε⁄⋅=
28 - 14
The speech signal: time domain
und types:
1.92
2.46 2.48
2.6 time/s
dime
transiente”
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Speech is a sequence of different so
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42-0.1
0
0.1
1.52 1.54 1.56 1.58-0.1
0
0.1
1.86 1.88 1.9-0.05
0
0.05
2.42 2.44
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4
watch thin as aahas
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: smooth transition“watch”
Stop burst:“dim
28 - 15
Timescale modification (TSM)
lower’?
r
pling rate
time/s
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Can we modify a sound to make it ‘si.e. speech pronounced more slowly- e.g. to help comprehension, analysis- or more quickly for ‘speed listening’?
• Why not just slow it down?
- , r = slowdown facto
- equiv. to playback at a different sam
xs t( ) xotr--
=
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
Original
2x slower
28 - 16
Time-domain TSM
e structure
r--- L n+
e / s
e / s
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Problem: want to preserve local timbut alter global time structure
• Repeat segments- but: artefacts from abrupt edges
• Cross-fade & overlap
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x m-⋅+=
2.35 2.4 2.45 2.5 2.55 2.6-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
6
5
tim
tim
28 - 17
Synchronous Overlap-Add (SOLA)
window to
on:
L n Km+ +
r--- L n K+ +
-- L n K+ + 2
-----------------------------------------
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Idea: Allow some leeway in placing optimize alignment of waveforms
• Hence,
where Km chosen by cross-correlati
1
2
Km maximizes alignment of 1 and 2
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x mr----⋅+=
Km
ym 1–
mL n+[ ] x m-⋅n 0=
Nov∑
ym 1–
mL n+[ ]( )2
∑ x mr--
∑
------------------------------------------------------------------------------0 K KU ≤ ≤
argmax =
28 - 18
The Fourier domain
s x)
E6820 SAPR - Dan Ellis L01 - 2002-01-
Fourier Series (periodic continuous x)
Fourier Transform (aperiodic continuou
x t( ) ck ejkΩ0t
⋅k∑=
ck1
2πT---------- x t( ) e
jkΩ0– t⋅ td
T 2⁄–
T 2⁄∫=
x t( ) 12π------ X jΩ( ) e
jΩt⋅ Ωd∫=
X jΩ( ) x t( ) ejΩt–⋅ td∫=
28 - 19
Discrete-time Fourier
ed x)
E6820 SAPR - Dan Ellis L01 - 2002-01-
DT Fourier Transform (aperiodic sampl
Discrete Fourier Transform (N-point x)
x n[ ] 12π------ X e
jω( ) ejωn⋅ ωd
π–
π∫=
X ejω( ) x n[ ] e
jωn–⋅∑=
x n[ ] X k[ ] ej2πkN
---------⋅
k∑=
X k[ ] x n[ ] ej2πkN
---------–⋅
n∑=
28 - 20
Sampling and aliasing
tinuous tants:
nt rapid
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Discrete-time signals equal the contime signal at discrete sampling ins
• Infrequent sampling cannot represefluctuations
• Nyquist limit (Fsamp/2) emerges from periodic spectrum:
xd n[ ] xc nT( )=
28 - 21
Speech sounds in the Fourier domain
(power)
ts
000 3000 4000
000 3000 4000
0.1
000 3000
-40
000 3000 4000
time domain frequency domain
freq / Hz
B
E6820 SAPR - Dan Ellis L01 - 2002-01-
- dB = 20.log10(amplitude) = 10.log10
• Voiced spectrum has pitch + forman
1.52 1.54 1.56 1.58-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2-100
-80
-60
-40
0 1000 2-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42-0.1
0
0 1000 2-100
-80
-60
1.86 1.87 1.88 1.89 1.9 1.91-0.05
0
0.05
0 1000 2-100
-80
-60
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: transition“watch”
Stop: transient“dime”
time / s
ener
gy
/ d
28 - 22
Short-time Fourier Transform
and freq
time / s
2πk n mL–( )N
--------------------------------)
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Want to localize energy in both time→break sound into short-time pieces
calculate DFT of each one
• Mathematically:
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6-0.1
0
0.1
freq
/ H
z
short-timewindow
DFT
L 2L 3L
X k m,[ ] x n[ ] w n mL–[ ] j(–exp⋅ ⋅n 0=N 1–∑=
28 - 23
The Spectrogram
age:
time / s
time / s
intensity / dB
-50
-40
-30
-20
-10
0
10
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Plot STFT as a grayscale imX k m,[ ]fr
eq /
Hz
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.1
0 0.5 1 1.5 2 2.5
28 - 24
Time-frequency tradeoff
ncy
2.6 time / s
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Longer of window w[n] gains frequeresolution at cost of time resolution
1.4 1.6 1.8 2 2.2 2.4
freq
/ H
z
0
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.2W
ind
ow
= 2
56 p
t“N
arro
wb
and
”W
ind
ow
= 3
2 p
t“W
ideb
and
”
28 - 25
Speech sounds on the Spectrogram
n rmants
time/s
dime
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Most popular speech visualization
• Wideband (short window) better thanarrowband (long window) to see fo
freq
/ H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6
watch thin as aahas
Vo
wel
: pe
riodi
c“h
as”
Fri
c've
: ap
erio
dic
“wat
ch”
Glid
e: tr
ansi
tion
“wat
ch”
Sto
p:
tran
sien
t“ d
ime”
28 - 26
TSM with the Spectrogram
1.4
1.4
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Just stretch out the spectrogram?
- how to resynthesize?
spectrogram is only
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Y k m,[ ]
28 - 27
The Phase Vocoder
domain
ram:
een slices:
ligned
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Timescale modification in the STFT
• Magnitude from ‘stretched’ spectrog
- e.g. by linear interpolation
• But preserve phase increment betw
- e.g. by discrete differentiator
• Does right thing for single sinusoid- keeps overlapped parts of sinusoid a
Y k m,[ ] X kmr----,=
θY k m,[ ] θ X kmr----,=
28 - 28
General issues in TSM
e
parate!ementtion...
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Time window- stretching a narrowband spectrogram
• Malleability of different sounds- vowels stretch well, stops lose natur
• Not a well-formed problem?- want to alter time without frequency
... but time and frequency are not se- ‘satisfying’ result is a subjective judg→solution depends on auditory percep
28 - 29
Summary
n
E6820 SAPR - Dan Ellis L01 - 2002-01-
• Information in sound- lots of it, multiple levels of abstractio
• Course overview- survey of audio processing topics- practicals, readings, project
• DSP review- digital signals, time domain- Fourier domain, STFT
• Timescale modification- properties of the speech signal- time-domain- phase vocoder