Automatic Set List Identification and Song Segmentation of Full-Length Concert Videos @ISMIR2014

Ju-Chiang Wang1,2, Ming-Chi Yen1, Yi-Hsuan Yang1, and Hsin-Min Wang1

1 Academia Sinica, Taiwan2 University of California, San Diego, USA

Automatic Set List Identification and Song Segmentation for Full-Length Concert Videos

1

Outline

• Motivation

• Problem Definition & Challenges

• Methodology – call for segmentation, song identification & alignment techniques

• Evaluation – dataset, performance metric & result

• Conclusion & Future Work

2

What is Set List?• A list of songs that a band/artist intends to play or has played in a

concert

3

Motivation• Millions of full-concert (unabridged footage) videos have become

available on YouTube

About 37 Million Results Returned!

4

Motivation

• Natural questions:• What is the set list?

• When does each song begin and end?

• Setlist and timecodeprovided by uploadersor viewers

5

Motivation

• Newly uploaded video, lesser-known artists lack this metadata

• Labeling is labor-intensive• Needs experienced annotators

• At least one should go through the entire video

• AN AUTOMATED SOLUTION IS DESIRABLE!!

No Timecodes! Song 06 is missing!

6

Problem Definition

• Two sub-tasks: set list identification and song boundary estimation

Process

S1 S2 S3…

A Live Concert Audio

Song ID

Start time End time

Studio Version Database

7

Challenges

• A live song can be played in many different ways• Handle musically motivated variations (timbre, tempo, key, & mode)

• Structural variation• Featuring transitions between consecutive songs, repeated oscillations

between the sections of different songs

• Events or sections with no reference in the studio version database• intro, outro, solo, banter, big rock ending, broken instruments, malfunction

• Play cover songs from other artists

• The unstable audio quality in user-contributed concert videos

8

Methodology: Three Components

• Segmentation process – studio version database construction• Index the database by extracting the thumbnail of each studio song

• Song identification process – audio fingerprinting (AF) [1] or cover song identification (CSID) [2] techniques• Select top 5 probable song candidates based on the matching scores

• Alignment process – dynamic time warping (DTW) (referring to [3])• Search for the boundaries and at the same time select the top best song

based on the alignment scores

9

[1] D. Ellis, Robust landmark-based audio fingerprinting, 2009.[2] J. Serra, et al. Cross recurrence quantification for cover song identification. New Journal of Physics, 11(9), 2009.[3] M. Meuller. Information retrieval for music and motion. 2007.

Studio Version Database Preparation

• For better efficiency and robustness to the song structure variationwe index each studio song by its thumbnail

• Assume each live song contains its thumbnail regardless of any musical variations

• Apply Segmentino [4] and an algorithm similar to [5] to extract the thumbnail• Ensure a thumbnail is highly repetitive, long enough, and important

10

[4] M. Mauch, et al. Using musical structure to enhance automatic chord transcription. In ISMIR, 2009.[5] B. Martin, et al. Indexing musical pieces using their major repetition. In JCDL, 2011.

The Proposed Greedy Approach

11

A full concert Z

A probe excerpt Xlength: L

StudioDatabase

(Thumbnails)

Song Identification(AF or CSID)

Y1

Y2

Y3

Y4

Y5

Top K Candidates(Entire Tracks)

Remove the Identified Song

Start from the End Point of

Y3 on Z

A new probe excerpt Xlength: L

Y3

Find Y3 that best matches X

Start point

End point

Estimate Boundarieson X for Y3

Smallest DTW cost

New start point

Alignment Process

DTW Alignment forBoundary Estimation

12A

pro

be excerp

tX

, len= L

The entire track of a studio candidate song

Global Optimal Warping Path (OWP)Yk

• Goal: Search for a local OWP that ends at the frame withthe minimum average cost

Average cost =Accumulated cost/OWP length

L = α × mean studio track length

Average cost

Search area:

[ 𝐿

2: 𝐿]

• Set the frame index as a boundary i

The frame index iwith the minimum average cost

13

The tru

ncated

pro

be excerp

t,len

= L’

The entire track of a studio candidate song

Search area:

[ 𝐿′

2: 𝐿′]

The frame index jwith the minimum average cost

DTW Alignment for Boundary & Song Selection

• Search backward from i for the frame that results in the minimum average cost, denoted by δk

• Set the frame index as a boundary j

• Obtain boundary pair (i, j)• Select the song with the

smallest δk (re-ranking)

Global Optimal Warping Path (OWP)Yk

ind

exi

Data Collection

• 20 popular full concert videosset list and timecodes

• Music genre: pop/rock

• Manually label and refine thestart and end boundaries

• 10 artists

• 115.2 studio tracks per artist

• 16.2 live songs per concert

14

Artist ID Artist # Concerts # Tracks

01 Coldplay 2 96

02 Maroon 5 2 62

03 Linkin’ Park 4 88

04 Muse 2 100

05 Green Day 2 184

06 Guns N' Roses 2 75

07 Metallica 1 136

08 Bon Jovi 1 205

09 The Caneberries 2 100

10 Placebo 2 126

Total 20 1152

Pilot Study – Song Identification Efficacy

• Goal: compare the performance between AF and CSID for live song identification

• Query: manually segment each live song from the concert

• Reference database: complete tracks of studio version

15

Method MAP Precision@1

AF 0.060 0.048

CSID 0.915 0.904

Random 0.046 0.009

Evaluation

• Input: a full concert audio, the set of reference studio tracks

• Output: a sequence of song indices, timecodes

• Evaluation metric: • Edit distance (ED): the dissimilarity between

the song sequence and the ground truth

• Boundary deviation (BD): start/end time difference, denoted by sBD and eBD

• Frame accuracy (FA): percentage of accurate frames (200ms, non-overlapped) (the larger the better)

16

output song sequence (10)

gro

un

d t

ruth

seq

ue

nce

(8

)

ED = 8

Quantitative Result for Each Concert (α = 1.5)

17

Concert ID Artist # Songs GT # Song OP ED sBD eBD FA

1 Metallica 20 15 17 6.5 89.1 0.3172 Linkin’ Park 17 17 4 3.3 12.3 0.7863 Coldplay 15 15 3 27.2 33.2 0.7444 Bon Jovi 23 25 14 8.8 66.8 0.4415 Placebo 19 18 5 11.5 27.8 0.641

6 Guns N' Roses 10 11 1 19.1 22.8 0.8757 Maroon 5 10 10 6 28.2 39.1 0.4288 Linkin’ Park 22 22 9 28.2 39.6 0.6109 Guns N' Roses 20 21 7 30.7 35.9 0.653

10 The Caneberries 17 15 4 5.3 9.8 0.758

11 The Caneberries 22 21 3 6.0 8.7 0.86012 Muse 17 19 7 32 21.9 0.68113 Maroon 5 9 12 5 110 155 0.50914 Coldplay 17 17 2 20.1 18.4 0.777

15 Maroon 5 11 11 7 50.9 72.9 0.39316 Linkin’ Park 17 20 9 36.9 24.7 0.54417 Muse 13 11 4 48.1 94.3 0.62618 Linkin’ Park 23 22 10 10.0 34.8 0.63619 Green Day 7 7 3 42.4 13.6 0.58420 Green Day 15 13 9 42.4 36.6 0.465

Qualitative Result for Three Concerts

18

Guns N' Roses - Live at the Ritz - 1988 - Full concert

Linkin Park - Road To Revolution [Full Concert] HD

Metallica - Rock am Ring 2012 (Full Concert)

Performance Comparison

19

Method # Songs GT # Song OP ED sBD eBD FA

Baseline

16.2

19.7 8.9 229 241 0.434

α = 1.2 18 7.3 25.7 57.3 0.582

α = 1.5 16.1 6.5 23.4 42.9 0.616

α = 1.8 14.6 8.4 29.3 45.3 0.526

• Baseline method: without using DTW alignment, a song starts at the probe expert and ends at the length of the identified song Y*, then hops 0.1 length of Y* for the next probe

• The full system with probe length 1.5μ performs the best, whereμ is the mean length of the studio tracks

Conclusion & Future Work

• Propose a novel MIR research problem• A new opportunity for MIR researchers to link music/audio technology to

real-world applications

• Develop a new dataset and a novel greedy approach

• Expand the size of the dataset and conduct more in-depth signal-level analysis of the dataset

• Propose this task to MIREX to call for more advanced approaches(due to the copyright issue on the studio tracks)

20

Thank You! Any Question?

Date post:	03-Jul-2015
Category:	Technology
Upload:	ju-chiang-wang
View:	90 times
Download:	0 times

Automatic Set List Identification and Song Segmentation of Full-Length Concert Videos @ISMIR2014

Technology