Post on 02-Jan-2016
transcript
Audio Thumbnailing of Popular Music Using Chroma-Based
Representations
Matt Williamson
Chris Scharf
Implementation based on:IEEE Transactions on Multimedia, Vol. 7, No. 1, February 2005Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE
Introduction
• Multimedia content is growing rapidly
• Efficient method of browsing is necessary
• Indexing and retrieval methods are media-dependent
Primary goal
• Minimize audition time for a given type of media
Current methods
• Images– Downsampling
• Produces a smaller version of image (thumbnail)• Reduces cost of delivery and display
Current methods
• Audio: speech– Symbolic representation
• Produces a transcript of the audio
What about music?
• Adapt an existing method:– Downsampling (time compression)
• Results in highly distorted, unintelligible audio
What about music?
• Adapt an existing method (cont’d):– Symbolic representation (score transcription)
• Extremely difficult• Results in essentially meaningless information• Does not convey other important elements:
– Vocal style– Instruments used– Processing effects used
Essential problem:
Adapting existing methods cannot reduce the audition time for music and effectively
convey the “gist” of the song
Possible Solution:
Audio thumbnailing via chroma-based analysis
Audio thumbnailing
• Produces a short clip of the selection to represent the “gist” of the song
Chroma-based analysis
• Based on the extraction of chroma features from the audio
• Chroma Feature Extraction Algorithm:– Frame Segmentation– Feature Calculation– Correlation Calculation– Correlation Filtering– Thumbnail Selection
Chroma Feature Extraction
• Extract frequencies from audio file• Calculate chroma values from frequencies:
• Categorize chroma values into pitch classes– 12 pitch classes: A, A#/Bb, C, C#/Db, …, G#/Ab
ffc 22 loglog
Frame Segmentation
• Author’s Implementation:– Determined via beat tracking algorithm– Range: 0.25s to 0.56s
• Our Implementation:– Average of range: 0.41s
Feature Calculation
• Calculate 12-element chroma feature vector, vt for each frame:– Apply FFT to each frequency:
– Constraints:• Minimum frequency: 20 Hz
– Lower limit of human hearing
• Maximum frequency: 2000 Hz– Higher frequencies effect the perception of chroma
}11...0{,)(
,
kN
nFv
kSn k
tkt
Correlation Calculation
• Calculate similarity matrix C– Each element is equal to the correlation between two
feature vectors:
– High correlation along diagonals in the matrix indicate repetitions within the song
jTiji vvC ,
Correlation Filtering
• Calculate the filtered time-lag matrix T:– Exposes similarity between extended segments that
are separated by constant lag– Filtering is performed along the diagonals of C
• Uses a symmetric rectangular windowing function (a uniform moving average filter)
– T is then “rotated” so that the diagonals are oriented vertically
k
kjikiji kwCT )(,,
Thumbnail Selection
• Select maximum value in T– The location of this value indicates:
• Occurrence of the segment (the y-coordinate)• Lag time (the x-coordinate)
– Constraints:• Minimum lag time = 1/10 of song length• Maximum start time = 3/4 of song length
– To reduce susceptibility to “fading repeat”
Results
• Jimmy Buffet – “Math Sucks”– System: [64, 89]
• Lifehouse – “You and Me”– System: [38, 63]
• Gavin DeGraw – “I Don’t Want To Be”– System: [95, 120]
• Super Mario Brothers Theme– System: [18, 43]
Conclusion
• Successfully extracted time segments which closely match the chorus of the song
• Feature Calculation issue:– Author’s implementation unclear
Possible Uses
• Audio domain:– Improved search capability
• Searching for similar songs
– Audio fingerprinting
• Other domains:– Detection of irregular heartbeats
Suggested Improvements and Alternatives
• Image-based analysis on the waveform
• Tested alternatives– MSE on signal frequencies
• Chroma-based analysis proved more correct