Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf...

Post on 02-Jan-2016

215 views 0 download

Tags:

transcript

Audio Thumbnailing of Popular Music Using Chroma-Based

Representations

Matt Williamson

Chris Scharf

Implementation based on:IEEE Transactions on Multimedia, Vol. 7, No. 1, February 2005Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE

Introduction

• Multimedia content is growing rapidly

• Efficient method of browsing is necessary

• Indexing and retrieval methods are media-dependent

Primary goal

• Minimize audition time for a given type of media

Current methods

• Images– Downsampling

• Produces a smaller version of image (thumbnail)• Reduces cost of delivery and display

Current methods

• Audio: speech– Symbolic representation

• Produces a transcript of the audio

What about music?

• Adapt an existing method:– Downsampling (time compression)

• Results in highly distorted, unintelligible audio

What about music?

• Adapt an existing method (cont’d):– Symbolic representation (score transcription)

• Extremely difficult• Results in essentially meaningless information• Does not convey other important elements:

– Vocal style– Instruments used– Processing effects used

Essential problem:

Adapting existing methods cannot reduce the audition time for music and effectively

convey the “gist” of the song

Possible Solution:

Audio thumbnailing via chroma-based analysis

Audio thumbnailing

• Produces a short clip of the selection to represent the “gist” of the song

Chroma-based analysis

• Based on the extraction of chroma features from the audio

• Chroma Feature Extraction Algorithm:– Frame Segmentation– Feature Calculation– Correlation Calculation– Correlation Filtering– Thumbnail Selection

Chroma Feature Extraction

• Extract frequencies from audio file• Calculate chroma values from frequencies:

• Categorize chroma values into pitch classes– 12 pitch classes: A, A#/Bb, C, C#/Db, …, G#/Ab

ffc 22 loglog

Frame Segmentation

• Author’s Implementation:– Determined via beat tracking algorithm– Range: 0.25s to 0.56s

• Our Implementation:– Average of range: 0.41s

Feature Calculation

• Calculate 12-element chroma feature vector, vt for each frame:– Apply FFT to each frequency:

– Constraints:• Minimum frequency: 20 Hz

– Lower limit of human hearing

• Maximum frequency: 2000 Hz– Higher frequencies effect the perception of chroma

}11...0{,)(

,

kN

nFv

kSn k

tkt

Correlation Calculation

• Calculate similarity matrix C– Each element is equal to the correlation between two

feature vectors:

– High correlation along diagonals in the matrix indicate repetitions within the song

jTiji vvC ,

Correlation Filtering

• Calculate the filtered time-lag matrix T:– Exposes similarity between extended segments that

are separated by constant lag– Filtering is performed along the diagonals of C

• Uses a symmetric rectangular windowing function (a uniform moving average filter)

– T is then “rotated” so that the diagonals are oriented vertically

k

kjikiji kwCT )(,,

Thumbnail Selection

• Select maximum value in T– The location of this value indicates:

• Occurrence of the segment (the y-coordinate)• Lag time (the x-coordinate)

– Constraints:• Minimum lag time = 1/10 of song length• Maximum start time = 3/4 of song length

– To reduce susceptibility to “fading repeat”

Results

• Jimmy Buffet – “Math Sucks”– System: [64, 89]

• Lifehouse – “You and Me”– System: [38, 63]

• Gavin DeGraw – “I Don’t Want To Be”– System: [95, 120]

• Super Mario Brothers Theme– System: [18, 43]

Conclusion

• Successfully extracted time segments which closely match the chorus of the song

• Feature Calculation issue:– Author’s implementation unclear

Possible Uses

• Audio domain:– Improved search capability

• Searching for similar songs

– Audio fingerprinting

• Other domains:– Detection of irregular heartbeats

Suggested Improvements and Alternatives

• Image-based analysis on the waveform

• Tested alternatives– MSE on signal frequencies

• Chroma-based analysis proved more correct