Informatik og Matematisk Modellering / Intelligent Signalbehandling 1Kaare Brandt Petersen Machine...

Informatik og Matematisk Modellering / Intelligent Signalbehandling

1Kaare Brandt Petersen

Machine Learning on Sound... how hard can it be?

Audio Information SeminarThursday, June 8, 2006Kaare Brandt Petersen

Kaare Brandt Petersen 2


Agenda Motivation

The reason it might be hard:- From data and information- Features

The good news:- Computer power and machine learning- Examples

Conclusions



Motivation What can we do with audio information?

News archive: Find the grumpy voice in a TV broadcasting from a busy street in the middle east. Search in newsarchives

Music: 6 billion friends. Navigating in the world landscape of music



Data Sound as perceived by humans

and by computers

-0.000762939453130.00231933593750-0.00714111328125 0.007720947265630.00076293945313-0.00772094726563-0.00900268554688-0.00527954101563-0.00076293945313-0.00231933593750-0.007141113281250.000244140625000.013122558593750.00650024414063-0.01052856445313-0.01089477539063-0.00305175781250-0.01052856445313-0.01089477539063-0.00305175781250

[ Beeps ]

- "There's the televison"

[ Music - violins ]

[ Steps ]- "Its all right there"- "All right there!"

- "Look. Listen. Neel. Pray" - "Commericals!"

[ Male voice - indoor ]

Dialogue Sound events

12 MonkeysMovie from 1995



Data Is the data-to-information translation really necessary?

1) Query by signal processing[ humans learn how computers think ]

2) Query by information[ computers learn how humans think ]

3) Query by example[ various approaches ]

"happy jazz"

ZCR < 198

Archive



Data Going from 5 million real

numbers to "Opera"

Bridging the gap: From data to information

Constructing sound features the right way

Information

Meaning

Context



Features Many shorttime features

Zero crossing rateSpectral flatnessSpectral bandwidthSpectral centroidsSpectral rolloffSpectral fluxEnergy...

Mel Frequency Cepstral Coefficients (MFCC) [Foote97, Rabiner93]Real Cepstral Coefficients (RCC) Linear Prediction Coefficients (LPC)Wavelets Gamma-tone-filterbanksSone / BarkChroma features...

ZCR

MFCC 1

Spec

Sp-Flatness

MFCC 2-7

Waveform

Sp-BandwidthSp-Centroid

Chroma

12 Monkeys sound clip



Features Aggregating shorttime features

Audio clip = data cloud

Distribution of valuesBasic statistics [Wold96]Histograms and vector quantization [Foote97]Gaussian Mixture Models [Auc02]K-means clustering [Logan01]Anchors by Neural Networks [Beren03]

Temporal modellingSVD of e.g. spectrogram [Gu04] AR-coefficients [Meng05]



Features What we are trying to do: From data to information

-0.000762939453130.00231933593750-0.00714111328125 0.007720947265630.00076293945313-0.00772094726563-0.00900268554688-0.00527954101563-0.00076293945313-0.00231933593750-0.007141113281250.000244140625000.013122558593750.00650024414063-0.01052856445313-0.01089477539063

Data

ZCRSpectralMFCCChromaSone/BarkRCCLPC...

Low-levelFeatures

Basic statsGMMKmeansAnchorsAR coeffSVDHMM...

High-levelFeatures

"Rough""Deep""Sparky""Broad""Melancolic""Majestic""Jazz""Rock"...

Information



Features Music similarity example

"Shape of my heart"Backstreet Boys, 2000

"Thats the way it is"Celine Dion, 2000

"Cantaloop"Us3, 1993

"The limitations observed in this paper (...) suggests that the usual route to timbre similarity may not be the optimal one" [Auc04]



The bad news Sound data is far from the information

Not all features are useful

It is not obvious what the information labels should be



The good news Computer power Signal processing

- strong development in signal processing and machine learning in general

- Large amounts of data

- Increased interest in sound and music processing



Example: Genre estimation Genre estimation by temporal

integration

Peter AhrendtAnders Meng[Meng05]

Processing:Sound -> MFCC -> AR



Example: Genre estimation Genre estimation by temporal integration +

kernel methods

Jeronimo Arenas-GarciaTue Lehn-SchiølerKaare Brandt Petersen [ArGa06]

Processing:Sound -> MFCC -> AR -> KOPLS

Btw: A data harvesting tool coming up - ISMIR 2006



Example: Source separation Spectrogram modelling with

sparse NTF2D

Morten MørupMikkel Schmidt, [Mørup06]

W = time-frequency patternsH = time, amplitude, pitch

048

0 2 4 6

Time [s]

Fre

qu

ency

[H

z]

0 0.2 0.4 0.6 0.8200

400

800

1600

3200

Original (mixed)

Separated sources (Harp) (Flute)



Example: CNN Translating a CNN news broadcast

Kasper JørgensenLasse MølgaardLars Kai Hansen[Jorg06]

Music or Speech?Sound -> MFCC, STE, SpF, ZCR -> mean/var

Speaker change detectionSound -> MFCC -> VQ

Speech recognitionSphinx 4 (Carnegie Mellon)



ConclusionsIt is hard:

Sound data is far from the information Good features are hard to find

but machine learning is catching up:

Examples: Genre, Source separation, CNN-translation



References[Wold96] Wold, E.; Blum, T.; Keislar, D. & Wheaton, J. "Content-based Classification, Search, and Retrieval of Audio" IEEE Multimedia, 1996, 3, 27-36 [Foote97] Foote, J."Content-based retrieval of music and audio", Multimedia Storage and Archiving Systems II, Proc. of SPIE, 1997, 3229, 138-147[Logan01] Logan and Salomon, "A music similarity function based on signal analysis", ICME 2001[Beren03] Berenzweig, Ellis and Lawrence, "Anchorspace for classification and similarity measurement of music" ICME 2003[Rabiner93] Rabiner, L. & Juang, B.H. "Fundamentals of Speech Recognition", Prentice-Hall, 1993 [Gu04] Gu, Lu, Cai and Zhang, "Dominant Feature vector based audio similarity measure", Proceedings of the Pacific Rim Conference on Multimedia, PCM, 2004[Tza02] Tzanetakis and Cook, "Music Genre Classification of Music", IEEE Transactions on Speech and Audio Processing, 2002, 10, 293-302[Auc02] Aucouturier and Pachet, "Music Similarity Measures: Whats the use?" ISMIR 2002 [Meng05] Anders Meng, Peter Ahrendt and Jan Larsen: "Improving Music Genre Classification by Short-Time Feature Integration", ICASSP, 2005. [Auc04] Aucouturier, Pachet, "Improving Timbre Similarity: How high is the sky?", JNRSAS, 2004[Mørup06] Sparse Non-negative Tensor Factor Double Deconvolution (SNTF2D) for multi channel time-frequency analysis", submitted to JMLR 2006[ArGa06], "Reduced Kaernel Orthonormal Partial Least Squares", submitted for NIPS 2006[Jorg06] Kasper Jørgensen, Lasse Mølgaard, Lars Kai Hansen, "Unsupervised speaker change detection for broadcast news segmentation", EUSIPCO 2006

Date post:	31-Mar-2015
Category:	Documents
Upload:	sonia-freeman
View:	214 times
Download:	0 times

Informatik og Matematisk Modellering / Intelligent Signalbehandling 1Kaare Brandt Petersen Machine...

Documents