GCT634: Musical Applications of Machine LearningRhythm Transcription
Dynamic Programming
Graduate School of Culture Technology, KAISTJuhan Nam
Outlines
• Overview of Automatic Music Transcription (AMT)- Types of AMT Tasks
• Rhythmic Transcription- Introduction- Onset detection- Tempo Estimation
• Dynamic Programming- Beat Tracking
Overview of Automatic Music Transcription (AMT)
• Predicting musical score information from audio- Primary score information is note but they are arranged based on rhythm,
harmony and structure- Equivalent to automatic speech recognition (ASR) for speech signals
Model
Beat
Key Chord
Structure
TempoOnsets
Types of AMT Tasks
• Rhythm transcription- Onset detection- Tempo estimation- Beat tracking
• Tonal analysis - Key estimation- Chord recognition
• Timbre analysis- Instrument identification
• Note transcription- Monophonic note- Polyphonic note- Expression detection
(e.g. vibrato, pedal)
• Structure analysis- Musical structure- Musical boundary / repetition
detection- Highlight detection
Types of AMT Tasks
• Rhythm transcription- Onset detection- Tempo estimation- Beat tracking
• Tonal analysis - Key estimation- Chord recognition
• Timbre analysis- Instrument identification
• Note transcription- Monophonic note- Polyphonic note- Expression detection
(e.g. vibrato, pedal)
• Structure analysis- Musical structure- Musical boundary / repetition
detection- Highlight detection
We will mainly focus on these topics!
Overview of AMT Systems
• Acoustic model- Estimate the target information given input audio (usually short segment)
• Musical knowledge- Music theory (e.g. rhythm, harmony), performance (e.g. playability)
• Prior/Lexical model- Statistical distribution of the score-level music information (e.g. chord
progression)
AcousticModel
Musical Knowledge
TranscriptionModel
Beat, Tempo
Key, Chords
Notes
Prior or Lexical Model
Audio-Level
Score-Level
Introduction to Rhythm
• Rhythm- A strong, regular, and repeated pattern of sound- Distinguish music from speech
• The most primitive and foundational element of music- Melody, harmony and other musical elements are arranged on the basis of
rhythm
• Human and rhythm- Human has innate ability of rhythm perception: heart beat, walking - Associated with motor control: dance, labor song
Introduction to Rhythm
• Hierarchical structure of rhythm- Beat (tactus): the most prominent level,
foot tapping rate- Division (tatum): temporal atom, eighth
or sixteenth- Measure (bar): the unit of rhythm
pattern (and also harmonic changes)
• Notations- Tempo: beats per minute, e.g. 90 bpm - Time signature: e.g. 4/4, 3/4, 6/8
[Wikipedia]
Human Perception of Tempo
• Mckinney and Moelant (2006)- Collect tapping data from 40 human subjects- Initial synchronization delay and anticipation (by tempo estimation)- Ambiguity in tempo: beat or its division ?
[D. Ellis’ e4896 slides]
Overview of Rhythm Transcription Systems
• Consists of several cascaded tasks that detect moments of musical stress (accents) and their regularity
Beat Tracking
Tempo Estimation
OnsetDetection
Musical Knowledge
Onset Detection
• Identify the starting times of musical events- Notes, drum sounds
• Types of onsets- Hard onsets: percussive sounds- Soft onsets: source-driven sounds (e.g. singing voice, woodwind, bowed
strings)
[M.Muller]
Example: Onset Detection
0 1 2 3 4 5 6−1
−0.5
0
0.5
1
time [sec]
ampl
itude
?
“Eat (꺼내먹어요) ”Zion.T
Onset Detection Systems
• Onset detection function (ODF)- Instantaneous measure of temporal change, often called “novelty” function- Types: time-domain energy, spectral or sub-band energy, phase difference
• Decision algorithm- Ruled-based approach- Learning-based approach
DecisionAlgorithm
Onset Detection Function
AudioRepresentations
(Feature Extraction) (Classifier)
Onset Detection Function (ODF)
• Types of ODFs- Time-domain energy- Spectral or sub-band energy- Phase difference
Time-Domain Onset Detection
• Local energy - Usually have high energy at onsets - Effective for percussive sounds
• Various versions- Frame-level energy
- Half-wave rectification
𝑂𝐷𝐹(𝑛) = 𝐸 𝑛 = ) 𝑥 𝑛 +𝑚 𝑤(𝑚) ./
012/
𝑂𝐷𝐹(𝑛) = 𝐻(𝐸 𝑛 + 1 − 𝐸 𝑛 )
𝐻 𝑟 =𝑟 + 𝑟2
= 8𝑟, 𝑟 ≥ 00, 𝑟 < 0
0 1 2 3 4 5 6−1
−0.5
0
0.5
1
time [sec]
ampl
itude
Waveform
0 1 2 3 4 5 60
5
10
15
20
time [sec]
OD
F
0 1 2 3 4 5 60
2
4
6
8
10
time [sec]
OD
F
Spectral-Based Onset Detection
• Spectral Flux- Sum of the positive differences from
log spectrogram- ODF changes depending on the
amount of compression 𝜌
time [sec]
frequ
ency−k
Hz
1 2 3 4 50
0.5
1
1.5
2
x 104
0 1 2 3 4 50
100
200
300
400
time [sec]
OD
F𝑂𝐷𝐹(𝑛) = ) 𝐻(𝑌 𝑛 + 1, 𝑘 − 𝑌 𝑛, 𝑘 )/2A
B1C
𝑌 𝑛, 𝑘 = log 1 + 𝜌 𝑋 𝑛, 𝑘 𝑋 𝑛, 𝑘 : STFT
Phase Deviation
• Sinusoidal components of a note is continuous while the note is sustained- Abrupt change in phase means that there may be a new event
[D. Ellis’ e4896 slides]
Deviation from the steady-statefor all frequency bins
ϕk (n)−ϕk (n−1) ≈ϕk (n−1)−ϕk (n− 2) Phase continuation (e.g. during sustain of a single note)
Δϕk (n) =ϕk (n)− 2ϕk (n−1)+ϕk (n− 2) ≈ 0
ζ p =1N
Δϕk (n)k=1
N
∑
Post-Processing
• DC removal - Subtract the mean of ODF
• Normalization- Scaling level of ODF
• Low-pass filtering- Remove small peaks
• Down-sampling- For data reduction
Low-pass Filtering (Solid line)
(Tzanetakis, 2010)
Onset Decision Algorithm
• Rule-based Approach: peak detection rule- Peaks above thresholds are determined as onsets- The thresholds are often adaptively computed from the ODF- Averaging and median are popular choices to compute the thresholds
threshold =α +β ⋅median(ODF) α : offset,β : scaling
Median with window size 5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5time [sec]
0
50
100
150
200
250
300
350
OD
F
ODFThreshold
Challenging Issue in Onset Detection: Vibrato
Onset detection using spectral flux
SuperFlux
• A state-of-the-art rule-based onset detection function- S. Bock et al., “Maximum Filter Vibrato Suppression For Onset Detection”,
DAFx, 2013
• Step1: log-spectrogram- Make harmonic partials have the same depth of vibrato contour
• Step2: max-filtering - Take the maximum in a window on the frequency axis- The vibrato contours become thicker
𝑌 𝑛,𝑚 = log 1 + 𝑋 𝑛, 𝑘 L 𝐹 𝑘,𝑚 𝑋 𝑛, 𝑘 : STFT
𝑌0MN 𝑛,𝑚 = max(𝑌 𝑛,𝑚 − 𝑙:𝑚 + 𝑙 )
SuperFlux
• A state-of-the-art rule-based onset detection function - S. Bock et al., “Maximum Filter Vibrato Suppression For Onset Detection”,
DAFx, 2013
• Step1: log-spectrogram- Make harmonic partials have the same depth of vibrato contours
• Step2: max-filtering - Take the maximum in a window on the frequency axis- The vibrato contours become thicker
𝑌 𝑛,𝑚 = log 1 + 𝑋 𝑛, 𝑘 L 𝐹 𝑘,𝑚 𝑋 𝑛, 𝑘 : STFT
𝑌0MN 𝑛,𝑚 = max(𝑌 𝑛,𝑚 − 𝑙:𝑚 + 𝑙 )
SuperFlux
Log-spectrogram
Max-filteredLog-spectrogram
SuperFlux
• Step3: Super-flux- Take the difference with some distance- Assumption: frame-rate is high in onset detection (i.e. small hop size)
• Step 4: pick-picking- 1) 𝑆𝐹∗(𝑛) = max(𝑆𝐹∗ 𝑛 − 𝑝𝑟𝑒0MN: 𝑛 + 𝑝𝑜𝑠𝑡0MN )- 2) 𝑆𝐹∗(𝑛) ≥ mean(𝑆𝐹∗ 𝑛 − 𝑝𝑟𝑒M\]: 𝑛 + 𝑝𝑜𝑠𝑡M\] ) + 𝛿- 3) 𝑛 − 𝑛_`a\bcde2cfeag > 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑤𝑖𝑑𝑡ℎ
𝑆𝐹∗(𝑛) = ) 𝐻(𝑌 𝑛 + 𝜇, 𝑘 − 𝑌 𝑛, 𝑘 )/2A
B1C𝜇 = max(1,
(𝑁2 − min 𝑛 𝑤 𝑛 > 𝑟 )ℎ
+ 0.5
(0 ≤ 𝑟 ≤ 1)
SuperFlux
Peak-picking
Max-filteredLog-spectrogram
Tempo Estimation
• Estimate a regular time interval between beats- Tempo is a global attribute of a song: e.g. bpm or mid-tempo song
• Tempo often changes within a song - Intentionally: e.g. dramatic effect: Top 10 tempo changes- Unintentionally: e.g. re-mastering, live performance
• There are also local tempo changes: e.g. rubato
Tempo Estimation Methods
• Auto-Correlation- Find the periodicity as used in pitch detection
• Discrete Fourier Transform- Use DFT over ODF and find the periodicity
• Comb-filter Banks- Leverage the “oscillating nature” of musical beats
Auto-Correlation
• ACF is a generic method to detect periodicity of a signal- Thus, this can be applied to ODF to find a dominant period that may
correspond to tempo- The ACF shows the dominant peaks that indicate dominant tempi
0 1 2 3 4 5−1
0
1
2
3 x 105
time [sec]O
DF
0 1 2 3 4 50
100
200
300
400
time [sec]
OD
F
Onset Detection Function (spectral flux) Auto-Correlation
Tempo Estimation Using Tempo Prior
• Tempo is estimated by multiplying the prior with the auto-correlation (observation)- The auto-correlation corresponds to a likelihood function- Tempo prior can be calculated from beat annotations of a dataset- The distribution fits to a log-normal distribution well
Histogram of beats from a dataset
[D. Ellis’ e4896 slides]
(Klapuri, 2003)
Beat Spectrum
• Leverage the repetitive nature of music
• Algorithm- Step1: compute cosine distance between two
frames of magnitude spectrogram
- Step 2: sum the elements on the diagonals
(Foote, 2001)
𝑆(𝑖, 𝑗) =𝑉b L 𝑉b𝑉b L 𝑉w
𝐵(𝑙) =)𝑆(𝑘, 𝑘 + 𝑙)�
B
Beat Spectrum
• A more robust version can be obtained from the 2D auto-correlation of the similarity matrix
• The final beat spectrum is derived by summing over one axis- The left plot shows five beats and a triplet
within a beat.
• “Beat spectrogram” can be also obtained by successive beat spectra
𝐵(𝑘, 𝑙) =)𝑆(𝑖, 𝑗) L 𝑆(𝑖 + 𝑘, 𝑗 + 𝑙)�
b,w
(Foote, 2001)
Five beats and a triplet within a beat
Tempogram
• Algorithm- Step 1: compute ODF from the half-wave
rectified spectral flux- Step2: obtain the frequency and phase
that provide the maximum correlation with for the ODF and form a local sinusoidal kernel
- Step 3: accumulate the successive local sinusoidal kernels to form a PLP curve
- Step 4: take DFT or auto-correlation(Grosche, 2009)
k(m) = w(m− n)cos(2π (ŵm− ϕ̂ ))
• Modeling the onset function using sinusoid as predominant local periodicity (PLP)
Tempogram
• Cyclic Tempogram- Accumulate the tempogram
for integer multiples of a tempo (up to four octaves)
- Conceptually similar to “Chromagram”
(Grosche, 2011)
Comb-Filter Banks
• Also called resonant filter banks- Comb filter equation
• Builds up rhythmic evidences (by anticipation?)
(Klapuri, 2006)
𝑦 𝑛 = 𝑥 𝑛 + 𝛼𝑦 𝑛 − 𝜏
Sub-band Resonant Filter Banks
• Algorithm- A sub-band filter bank as a front-end
processing - Parallel ODFs for 6 bands- 150 resonators for each band and all
possible tempo values (60 - 240 bpm)
- Pick up the delay that provides the highest peak as a tempo
(Scheirer, 1998)
Beat Tracking
• Estimate the position of beats in music - Usually a subset of detected onsets selected by the tempo
Beat Tracking by the Resonator Model
• Once the resonator model chooses the tempo that returns the highest peaks, the output produces a sequence of resonated peaks- They correspond to the beats
(Scheirer, 1998)
• Find the optimal “hopping” path on music (Ellis, 2007)
- 𝐶 𝑡b : cost of the path 𝑡b- 𝑂 𝑡b : onset strength function (i.e. ODF)
- 𝐹(∆𝑡, 𝑇): tempo (𝑇) consistency score: e.g. 𝐹 ∆𝑡, 𝑇 = −(𝑙𝑜𝑔 ∆g).
Beat Tracking by Dynamic Programming
𝐶 𝑡b =)𝑂 𝑡b
b1A
+ 𝛼)𝐹 𝑡b − 𝑡b2A, 𝑇
b1.
. . .
1
Finding the Minimum-Cost-Path
• Naïve approach- Find all paths from A to K and calculate the cost for each, and choose the
path that has the minimum cost.- As the number of nodes increases, the number of possible paths increases
exponentially
A C
B
D
E
F
G
H
24
3
36
2
42
2
32
5
4 12
33
1
53
I
J
K7
45
6
3
3
5
74
3 23
2
Dynamic Programming (DP)
• Observation- Say the minimum-cost-path passes by a node p, - What is the minimum-cost-path from A to p ?- It is just a sub-path of the minimum-cost-path from A to K.- Thus, we don’t have to compute the cost from scratch; we can use the cost
computed from the previous nodes.
A C
B
D
E
F
G
H
24
3
36
2
42
2
32
5
4 12
33
1
53
I
J
K7
45
6
3
3
5
74
3 23
2
Dynamic Programming (DP)
• The minimum cost is computed by the following equation:
• The minimum-cost-path can be found by tracing back the computation
Ck ( j) =Ok ( j)+mini {Ck−1(i)+ cij}Ck ( j)Ok ( j)
: cost up to node j: local cost at node j
cij : transition cost from i to j
A C
B
D
E
F
G
H
24
3
36
2
42
2
32
5
4 12
33
1
53
I
J
K7
45
6
3
3
5
74
3 23
2
Applying DP to Beat Tracking
• To optimize:
- Define 𝐶∗ 𝑡 as best score up to time 𝑡 and compute it for every 𝑡
- Also, store the time that returns maximum score 𝑃 𝑡
- At the end of the sequence, traceback 𝑃 𝑡 , which returns the best path 𝑡b
𝐶 𝑡b =)𝑂 𝑡b
b1A
+ 𝛼)𝐹 𝑡b − 𝑡b2A, 𝑇
b1.
𝐶∗ 𝑡 = 𝑂 𝑡 + max{𝛼𝐹 𝑡 − 𝜏, 𝑇 + 𝐶∗ 𝜏 }
𝑃 𝑡 = argmax
{𝛼𝐹 𝑡 − 𝜏, 𝑇 + 𝐶∗ 𝜏 }
0 1 2 3 4 50
100
200
300
400
time [sec]
ODF
𝑡𝜏
𝐶∗ 𝑡
Example of DP to Beat Tracking
References
• E. Scheirer, “Tempo and Beat Analysis of Acoustic Musical Signals”, 1998• J. Foote and S. Uchihashi, “The Beat Spectrum: A New Approach to Rhythm
Analysis”, 2001• G. Tzanekatis, “Musical Genre Classification of Audio Signals”, 2002• A. Klapuri, “Analysis of the Meter of Acoustic Musical Signals”, 2006• P. Grosche and M. Muller, “Computing Predominant Local Periodicity
Information In Music Recordings”, 2009• P. Grosche and M. Muller, “Cyclic Tempogram – A Mid-Level Tempo
Representation For Music Signals”, 2010• D. Ellis, “Beat Tracking by Dynamics Programming”, 2007• S. Bock and G. Widmer, “Maximum Filter Vibrato Suppression For Onset
Detection”, 2013