Date post: | 03-Jul-2015 |
Category: |
Technology |
Upload: | ju-chiang-wang |
View: | 90 times |
Download: | 0 times |
Ju-Chiang Wang1,2, Ming-Chi Yen1, Yi-Hsuan Yang1, and Hsin-Min Wang1
1 Academia Sinica, Taiwan2 University of California, San Diego, USA
Automatic Set List Identification and Song Segmentation for Full-Length Concert Videos
1
Outline
• Motivation
• Problem Definition & Challenges
• Methodology – call for segmentation, song identification & alignment techniques
• Evaluation – dataset, performance metric & result
• Conclusion & Future Work
2
What is Set List?• A list of songs that a band/artist intends to play or has played in a
concert
3
Motivation• Millions of full-concert (unabridged footage) videos have become
available on YouTube
About 37 Million Results Returned!
4
Motivation
• Natural questions:• What is the set list?
• When does each song begin and end?
• Setlist and timecodeprovided by uploadersor viewers
5
Motivation
• Newly uploaded video, lesser-known artists lack this metadata
• Labeling is labor-intensive• Needs experienced annotators
• At least one should go through the entire video
• AN AUTOMATED SOLUTION IS DESIRABLE!!
No Timecodes! Song 06 is missing!
6
Problem Definition
• Two sub-tasks: set list identification and song boundary estimation
Process
S1 S2 S3…
A Live Concert Audio
Song ID
Start time End time
Studio Version Database
7
Challenges
• A live song can be played in many different ways• Handle musically motivated variations (timbre, tempo, key, & mode)
• Structural variation• Featuring transitions between consecutive songs, repeated oscillations
between the sections of different songs
• Events or sections with no reference in the studio version database• intro, outro, solo, banter, big rock ending, broken instruments, malfunction
• Play cover songs from other artists
• The unstable audio quality in user-contributed concert videos
8
Methodology: Three Components
• Segmentation process – studio version database construction• Index the database by extracting the thumbnail of each studio song
• Song identification process – audio fingerprinting (AF) [1] or cover song identification (CSID) [2] techniques• Select top 5 probable song candidates based on the matching scores
• Alignment process – dynamic time warping (DTW) (referring to [3])• Search for the boundaries and at the same time select the top best song
based on the alignment scores
9
[1] D. Ellis, Robust landmark-based audio fingerprinting, 2009.[2] J. Serra, et al. Cross recurrence quantification for cover song identification. New Journal of Physics, 11(9), 2009.[3] M. Meuller. Information retrieval for music and motion. 2007.
Studio Version Database Preparation
• For better efficiency and robustness to the song structure variationwe index each studio song by its thumbnail
• Assume each live song contains its thumbnail regardless of any musical variations
• Apply Segmentino [4] and an algorithm similar to [5] to extract the thumbnail• Ensure a thumbnail is highly repetitive, long enough, and important
10
[4] M. Mauch, et al. Using musical structure to enhance automatic chord transcription. In ISMIR, 2009.[5] B. Martin, et al. Indexing musical pieces using their major repetition. In JCDL, 2011.
The Proposed Greedy Approach
11
A full concert Z
A probe excerpt Xlength: L
StudioDatabase
(Thumbnails)
Song Identification(AF or CSID)
Y1
Y2
Y3
Y4
Y5
Top K Candidates(Entire Tracks)
Remove the Identified Song
Start from the End Point of
Y3 on Z
A new probe excerpt Xlength: L
Y3
Find Y3 that best matches X
Start point
End point
Estimate Boundarieson X for Y3
Smallest DTW cost
New start point
Alignment Process
DTW Alignment forBoundary Estimation
12A
pro
be excerp
tX
, len= L
The entire track of a studio candidate song
Global Optimal Warping Path (OWP)Yk
• Goal: Search for a local OWP that ends at the frame withthe minimum average cost
Average cost =Accumulated cost/OWP length
L = α × mean studio track length
Average cost
Search area:
[ 𝐿
2: 𝐿]
• Set the frame index as a boundary i
The frame index iwith the minimum average cost
13
The tru
ncated
pro
be excerp
t,len
= L’
The entire track of a studio candidate song
Search area:
[ 𝐿′
2: 𝐿′]
The frame index jwith the minimum average cost
DTW Alignment for Boundary & Song Selection
• Search backward from i for the frame that results in the minimum average cost, denoted by δk
• Set the frame index as a boundary j
• Obtain boundary pair (i, j)• Select the song with the
smallest δk (re-ranking)
Global Optimal Warping Path (OWP)Yk
ind
exi
Data Collection
• 20 popular full concert videosset list and timecodes
• Music genre: pop/rock
• Manually label and refine thestart and end boundaries
• 10 artists
• 115.2 studio tracks per artist
• 16.2 live songs per concert
14
Artist ID Artist # Concerts # Tracks
01 Coldplay 2 96
02 Maroon 5 2 62
03 Linkin’ Park 4 88
04 Muse 2 100
05 Green Day 2 184
06 Guns N' Roses 2 75
07 Metallica 1 136
08 Bon Jovi 1 205
09 The Caneberries 2 100
10 Placebo 2 126
Total 20 1152
Pilot Study – Song Identification Efficacy
• Goal: compare the performance between AF and CSID for live song identification
• Query: manually segment each live song from the concert
• Reference database: complete tracks of studio version
15
Method MAP Precision@1
AF 0.060 0.048
CSID 0.915 0.904
Random 0.046 0.009
Evaluation
• Input: a full concert audio, the set of reference studio tracks
• Output: a sequence of song indices, timecodes
• Evaluation metric: • Edit distance (ED): the dissimilarity between
the song sequence and the ground truth
• Boundary deviation (BD): start/end time difference, denoted by sBD and eBD
• Frame accuracy (FA): percentage of accurate frames (200ms, non-overlapped) (the larger the better)
16
output song sequence (10)
gro
un
d t
ruth
seq
ue
nce
(8
)
ED = 8
Quantitative Result for Each Concert (α = 1.5)
17
Concert ID Artist # Songs GT # Song OP ED sBD eBD FA
1 Metallica 20 15 17 6.5 89.1 0.3172 Linkin’ Park 17 17 4 3.3 12.3 0.7863 Coldplay 15 15 3 27.2 33.2 0.7444 Bon Jovi 23 25 14 8.8 66.8 0.4415 Placebo 19 18 5 11.5 27.8 0.641
6 Guns N' Roses 10 11 1 19.1 22.8 0.8757 Maroon 5 10 10 6 28.2 39.1 0.4288 Linkin’ Park 22 22 9 28.2 39.6 0.6109 Guns N' Roses 20 21 7 30.7 35.9 0.653
10 The Caneberries 17 15 4 5.3 9.8 0.758
11 The Caneberries 22 21 3 6.0 8.7 0.86012 Muse 17 19 7 32 21.9 0.68113 Maroon 5 9 12 5 110 155 0.50914 Coldplay 17 17 2 20.1 18.4 0.777
15 Maroon 5 11 11 7 50.9 72.9 0.39316 Linkin’ Park 17 20 9 36.9 24.7 0.54417 Muse 13 11 4 48.1 94.3 0.62618 Linkin’ Park 23 22 10 10.0 34.8 0.63619 Green Day 7 7 3 42.4 13.6 0.58420 Green Day 15 13 9 42.4 36.6 0.465
Qualitative Result for Three Concerts
18
Guns N' Roses - Live at the Ritz - 1988 - Full concert
Linkin Park - Road To Revolution [Full Concert] HD
Metallica - Rock am Ring 2012 (Full Concert)
Performance Comparison
19
Method # Songs GT # Song OP ED sBD eBD FA
Baseline
16.2
19.7 8.9 229 241 0.434
α = 1.2 18 7.3 25.7 57.3 0.582
α = 1.5 16.1 6.5 23.4 42.9 0.616
α = 1.8 14.6 8.4 29.3 45.3 0.526
• Baseline method: without using DTW alignment, a song starts at the probe expert and ends at the length of the identified song Y*, then hops 0.1 length of Y* for the next probe
• The full system with probe length 1.5μ performs the best, whereμ is the mean length of the studio tracks
Conclusion & Future Work
• Propose a novel MIR research problem• A new opportunity for MIR researchers to link music/audio technology to
real-world applications
• Develop a new dataset and a novel greedy approach
• Expand the size of the dataset and conduct more in-depth signal-level analysis of the dataset
• Propose this task to MIREX to call for more advanced approaches(due to the copyright issue on the studio tracks)
20
Thank You! Any Question?