+ All Categories
Home > Documents > Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in...

Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in...

Date post: 31-Aug-2018
Category:
Upload: dokhuong
View: 247 times
Download: 0 times
Share this document with a friend
40
Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at the School of Media Technology Royal Institute of Technology year 2009 Supervisor at SR was Björn Carlsson Supervisor at CSC was Anders Friberg Examiner was Sten Ternström
Transcript
Page 1: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automatic speech/music

discrimination in audio files

Lars Ericsson

Master’s thesis in Music Acoustics (30 ECTS credits)

at the School of Media Technology Royal Institute of Technology year 2009

Supervisor at SR was Björn Carlsson Supervisor at CSC was Anders Friberg

Examiner was Sten Ternström

Page 2: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at
Page 3: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automatic speech/music discrimination in audio files

AbstractThismaster’sthesispresentsanalgorithmfordiscriminationofspeechandmusicinaudiofiles.ThealgorithmismadeforSwedishRadioandisthereforesuitedfortheirneedsandoptimizedfortheirmaterial.AseriesoftestsonacousticfeaturesextractedfromSwedishRadiobroadcastswasperformedtodistinguishthebestmethodforthediscrimination.Thesetestswereevaluatedtofindthemostsuitedmethodforthetask.MethodsbasedonRMSamplitudeofthesignalwerechosenforbothclassificationandsegmentation.AfeaturefortheproportionoflowenergythatusesthesmallpausesthatcantellspeechfrommusicwasusedfortheclassificationandasimilaritymeasurebasedonmeanandvarianceoftheRMSwasusedtofindthetransitionpointsforthesegments.Thisresultedinanalgorithmthatwithanaccuracyof97,3%candiscriminatespeechfrommusicinSwedishRadiobroadcasts.

Page 4: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automatisk diskriminering av tal och musik i ljudfiler

Sammanfattning Dettaexamensarbetepresenterarenalgoritmfördiskrimineringavtalochmusikiljudfiler.AlgoritmenärgjordförSverigesRadioochärdärförutformadutefterderasbehovochoptimeradförderasmaterial.FörattkommaframtilldenbästametodenfördiskrimineringengjordestesterpåakustiskaparametrarsomextraheratsurmaterialfrånSR:sdigitalaarkiv.Dessatesterutvärderadessedanföratthittadenmetodsomvarbästlämpadförändamålet.FörbådeklassificeringochsegmenteringvaldesmetodersombaseraspåRMSföramplitudenavsignalen.EnparameterförlågnivåproportionersomutnyttjardekortapausernasomskiljertalfrånmusikvaldesförattklassificerasegmentochettlikhetsmåttbaseradpåmedelvärdeochvariansavRMSvaldesföratthittagränsernatillsegmenten.Dettaresulteradeienalgoritmsommed97,3%träffsäkerhetkandiskrimineratalfrånmusikiSR:sprogramutbud.

Page 5: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Contents

1 Introduction ................................................................................. 1 1.1 Speech/Music discrimination ..................................................... 1 1.2 Swedish Radio .............................................................................. 1 1.3 Goal............................................................................................... 1 1.4 Method.......................................................................................... 1 1.5 Limitations ..................................................................................... 2 1.6 Overview of the paper................................................................ 2

2 Background................................................................................. 3 2.1 System structure ........................................................................... 3 2.1.1 Online systems...................................................................... 3 2.1.2 Offline systems...................................................................... 3 2.2 Features and feature extraction................................................ 4 2.2.1 Speech and music .............................................................. 4 2.2.2 Standard Low-Level (SLL) features .................................... 8 2.2.2.1 RMS.............................................................................. 8 2.2.2.2 Zero Crossing Rate..................................................... 8 2.2.2.3 Spectral Centroid ...................................................... 8 2.2.2.4 Spectral Rolloff ........................................................... 9 2.2.2.5 Flux (Delta Spectrum Magnitude) ........................... 9 2.2.3 Frequency Cepstrum Coefficients .................................... 9 2.2.4 Psychoacoustic features .................................................. 10 2.2.5 Special features ................................................................. 10 2.2.6 Psychoacoustic pitch scales ............................................ 10 2.2.7 Extracting features............................................................. 11 2.3 Segmentation............................................................................. 11 2.4 Classification methods .............................................................. 11 2.4.1 Hidden Markov Models..................................................... 12 2.4.2 System learning.................................................................. 12 2.4.3 Refined classification ........................................................ 12 2.5 Evaluation methods................................................................... 12 2.6 Earlier results................................................................................ 13

3 Evaluation and tests..................................................................... 14 3.1 Test database................................................................................ 14 3.2 Tools ................................................................................................ 14 3.3 Feature tests................................................................................... 15 3.3.1 Feature test results ............................................................ 16 3.3.2 Low-level features............................................................. 17 3.3.3 Mel Frequency Cepstrum Coefficients .......................... 18 3.3.4 Modified Low Energy Ratio.............................................. 19 3.4 Segmentation tests ....................................................................... 20 3.5 Test evaluations............................................................................. 21

Page 6: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

4 Algorithm ...................................................................................... 23 4.1 Signal preprocessing .................................................................... 23 4.2 Feature extraction ........................................................................ 23 4.3 Segmentation................................................................................ 23 4.4 Classification.................................................................................. 25 4.5 Refinement .................................................................................... 26 4.6 Output ............................................................................................ 26 4.7 Results ............................................................................................. 27

5 Conclusions .................................................................................. 28

6 Future work ................................................................................... 29

7 Acknowledgements .................................................................... 30

8 References.................................................................................... 31

Page 7: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

List of Abbreviations

DCS DiscreteCosineFunctionERB EquivalentRectangularBandwidthFFT FastFourierTransformLFCC LinearFrequencyscaledCepstrumCoefficientsLPC LinearPredictiveCodingMFCC MelFrequencyCepstrumCoefficientsMLER ModifiedLowEnergyRatioRMS RootMeanSquareSC SpectralCentroidSLL StandardLowLevel(features)SMD Speech/MusicDiscriminationSR SwedishRadioZCR ZeroCrossingRate

Page 8: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at
Page 9: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

1

1 Introduction Thischapterincludesanoverviewofthetask,thepurpose,methodandlimitationsofthework.Italsogivesabriefviewoftherestofthereport. 1.1 Speech/Music discrimination Thepurposeofspeech/musicdiscrimination(SMD)systemsistodivideaudiotomusicandspeechsegmentsandclassifythem,whetherthediscriminationisdoneinrealtimeoronrecordedaudiofiles.

TheSpeech/MusicDiscriminationtaskisanimportantpartofAutomaticSpeechRecognition(ASR)systems,whereitisusedtodisabletheASRwhenmusicorotherclassesofaudioarepresentinautomatictranscriptionofspeech.

SMDsystemsarealsousefulforbit‐ratecoders.Speechcodersachievebetterresultsforspeechcodingthanmusiccodersdo,andviceversa,thereforeitisimportanttodiscriminatebetweenthetwoaudioclasses,toselecttherighttypeofbit‐ratecoding.

WhenindexingsoundandevenvideoaSMDsystemcanbeofgreatimportance.Apartfromusingthedetectionofspeechandmusicinaudiofiles,thesamealgorithmcanbeappliedtotheaudioofTVshowsormovies.Thiscanlaterbeusedforindexingthevideomaterialtobeabletojumpstraighttoadesiredpartine.g.on‐demandvideosolutionsfortheweb.

1.2 Swedish Radio SRisanon‐commercial,publicserviceradiobroadcasterwithover40radiochannels,includingfournationalFMchannels(P1,P2,P3andP4)and28localchannels.P4isthebiggestradiochannelinSweden.SRisalsoofferingmorethan10channelsonlineontheirwebsite,togetherwithanarchiveofallbroadcastedprogramsavailableondemand.Broadcastsarealsoavailableviashortwave,mediumwaveandsatellite.[12]

1.3 Goal AsystemthateffectivelydiscriminatedbetweenspeechandmusicwillbeusefulforSwedishRadioinmanyoftheabovementionedapplications.ThegoalisthereforetocreateanalgorithmthatcandothetaskaccuratelyforthematerialproducedbySwedishRadio.Tomeettherequirementsofacostefficient,accuratealgorithm,anofflinesystemstructurewaschosen.

1.4 Method Theprojectstartedbyreviewingnecessaryliteratureandarticlesofpreviouswork.Thismadeitpossibletoenterthetestphasewithknowledgeofthearea.Inthisphasemanyfeaturesanddiscriminationmethodsweretestedandevaluatedtobeabletofindthemethodbestsuitedforthisspecifictask.ThechosenmethodwasthencodedinawaythatpermittedanintegrationintoexistingsystemsatSwedishRadio.

Page 10: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

2

1.5 Limitations Thisworkfocusedonthediscriminationbetweenspeechandmusic,anddidnotafinerdiscriminationwithintheclasses.Thus,thealgorithmwillnotbeabletodiscriminatebetweendifferentspeakerssuchaswomen,menorchildrenandwillnotbeabletotelldifferentvoicesapart.Musicwillonlyconsistofoneclass,andwillnotbefurtherdividedintodifferentclassesfordifferentgenres.

1.6 Overview of the paper ThepaperstartswithabackgroundchapterincludinganintroductionoftheareaofMusicInformationRetrievalandanoverviewofearlierworkdoneinthespeech/musicdiscriminationarea.Thechapteralsopresentsfeaturesthatarecommonlyusedforthesekindsofsystems.Chapter3containsallthetestsdonewithinthiswork.Italsopresentsthetestresultsandendswithanevaluationoftheresults.Chapter4givesadetaileddescriptionofthefinishedalgorithmandthepaperthenendswithachapterincludingfinalconclusionsdrawnfromtheproject.

Page 11: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

3

2 Background ThischapterwillgiveanintroductiontothepartoftheMusicInformationRetrievalresearcharea,whichhasbeenusedduringthisproject.Itwillalsogiveabrieflookatearlierwork.

2.1 System structure EarliersystemsdevelopedforSMDtaskshavedifferentstructures.Theycanbedividedintotwomaingroups,onlinesystemswherethediscriminationismadeinreal‐timeandofflinesystemswherethediscriminationismadeonaudiofiles.Bothgroupshavetheiradvantages;onlinesystemsarebettersuitedforlivepurposeswhileofflinesystemscanbemademoreaccurateandfaster.Differenttasksrequiredifferentsystems,suitedforspecificpurposes.

2.1.1 Online systems Inonlinesystems,boththesegmentationandtheclassificationtasksneedtobedoneatthesametime.Theycanevenberegardedasonetask,wheretheoutputoftheclassifierisusedforthesegmentation.Thisisalldoneinreal‐time.”Real‐time”meansthatthesystemoutputsresultscontinuouslyastheinputaudiostreamcomesin,butwithadelayofatleastoneanalysisframe.

Onlinesystemsoftenconcentrateonfindinglargechangesintheaudiotobeabletofindbordersofspeechormusicsegments.ThisisdonebySaundersin[9]bydividingtheaudiointonon‐overlapping16mslongframes(256samplesat16KHz)fromwhichsimplefeaturesareextractedandusingalonger2,4second(150x16msframes)analyzeframeforstatisticalfeaturesusedintheclassification(seenextchapterformoreonfeatures).Hereamultivariate‐Gaussianclassifierdoestheclassification.

Anotherreal‐timesystemisdescribedin[7].Atypicalframe‐sizeofearlieronlinesystemsisaround20msandtheanalysisframevariesfromhalfaseconduptothreeseconds.Theuseofstatisticalfeaturesdemandslongeranalysisframestogetgoodresults.Theshortframeisusedtoextractfeaturesandthelongeranalysisframeisusedtoextractstatisticsofthesefeatures,suchasvarianceandmeanvalues.

2.1.2 Offline systems Thestructureofofflinesystemsvariesconsiderably.Themostcommonsystemhasthreestepspluspre‐andpost‐processing,asseeninFigure1.Firsttheaudioisdividedintoframes.Fromeachoftheframesasetoffeaturesisextractedandstoredinafeaturevector.Inthesecondstepthesystemsegmentsthesoundandinthethirdstepthesegmentsareclassifiedbyusingsomekindofclassificationmethodtodecidewhetherasegmentconsistsofspeechormusic.TheSMDtaskisoftendoneinmorestepsinanofflinesystemthaninanonlinesystem.Thesegmentationcanberefinedasin[8]byusingthefactthatneighboringsegmentshavehighprobabilitiesofcontainingthesameclass.

Page 12: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

4

Figure1.Blockdiagramofanofflinesystemstructure

In[8]someclassificationisdoneduringthesegmentation,andthemoredifficultsegmentsareleftforanothermorecomplexclassificationmethod.Thissavescomputationtime,sinceasimplerclassifiermakesthefirstclassification.Othersystemsclassifieslargesegments,andthendividethesegmentswhereaborderisfoundintosmallersegments,whichalsoisawaytosavecomputationtime.

Offlinesystemsoftenuseaframelengthofaround20msandananalysiswindowlengtharound1second.Shorterframesmakethefeaturesmoresensitivetonoise,whilelongerframesmightincludetoomanyphonemes.

2.2 Features and feature extraction Soundhasmanyfeatures,somethatourearscanpickupandsomethatwecannotevenhear.Speechhasbeencloselystudiedandisrelativelywelldefined,whilstmusicisamuchwiderclassofsound.TheFrenchcomposerEdgarVarèsedefineditas“Musicisorganizedsound”.Thismightseemabstract,butcanbeusedeveninthesekindsoftechnicalsolutions.Itisacommonapproachtolookforrepetitioninsoundtoclassifyitasmusic.

2.2.1 Speech and music Tobeabletodistinguishbetweenspeechandmusic,featuresthatdifferbetweenthetwoclassesneedtobeused.Asimplelookatthewaveformof1minuteexcerptsofspeech,popmusic,classicalmusicandopera(allexamplestakenfromSwedishradioSRbroadcasts)alreadyindicateslargedifferencesbetweentheclasses.ThespeechwaveforminFigure2showsrapidchangesinenergyandamplitudethatnoneofthemusicwaveformsdoes.TheheavilycompressedwaveformofTheKillers’songinFigure3seemstototallylackdynamics,whiletheclassicalpieceinFigure4andtheoperapieceinFigure5hassomeshortamplitudepeaksandshowslargedynamicvariations.

Page 13: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

5

Figure2.1minuteofspeech.

Figure3.1minuteexcerptofTheKillers–AllTheseThings.

Figure4.1minuteofclassicalmusic.

Figure5.1minuteofopera.

Eveniftheclassesareeasytoidentifyinthewaveform,theexactpositionofthetransitionscanbedifficulttodetect.Figure6showsanexcerptfromP3Popwhereapopsongabruptlystopsandthehostoftheshowstartsspeaking.Thetransitionismarkedwithaverticalline.

Figure6.WaveformrepresentationofexcerptofP3Popwithtransitionfrommusictospeechmarked.

ThechangesinenergyareevenclearerwhenthesoundwaveisplottedinFigure6.Thescalesaren’tthesameinthefourexamples,butthemostinterestingthingisthechanges.InalltheexamplescontainingmusictheRMSnevergoesdowntozeroanddoesn’tdivergelargelyfromthemeanRMS.However,inthespeechexampletherearerelativelymanyframescontainingzeroorclosetozeroRMS,andthechangesarerapid.WhatlookslikelargevariationsinRMSvaluesofthepopmusicinFigure7canbeexplainedbythescaleontheY‐axis,rangingfrom0to0.1,whilespeechrangesfrom0to0.25.

Page 14: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

6

Figure7.RMSgraphs.Y‐axisshowstheRMSvaluecalculatedfrom20msframesandX‐axisshowstemporallocationins.Topleft:Speech.Topright:Popmusic.Bottomleft:Classicalpiece.Bottomright:Operapiece.

BylookingatthespectrumofthefourexamplesinFigure8itisclearthatallthemusicexampleshaveahigherpeakinthelowfrequencies,althoughthepeakoccursatdifferentfrequencies.Thispeakmostlikelycorrespondstothefundamentalfrequencyofthevocalcomponents.Theclassicalpiece,whichlacksvocals,doesnothavethesamesharppeakastheotherthreeexamples.Thespeechexamplehasmoreenergyinthefrequencyrangearound10000Hzthanthemusicexamples.

Page 15: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

7

Figure8.SpectrumplotsanalysedusingaHanningwindowswithawindowsizeof1024samples.Y‐axisshowssoundlevel(dB)andX‐axisshowsfrequency(Hz).Fromthetop:speech,popmusic,classicalpiece,operapiece.

Page 16: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

8

Thefeaturesdescribedbelowaredividedintothreegroupsaccordingto[5]:ThesimplerStandardlow‐level(SLL)features,theFrequencyCepstrumCoefficientsandthemoreadvancedPsychoacousticfeatures.

OnlinesystemsoftenusesimplerfeaturestoavoidhavingtocomputetheFFTtransform,whichisarelativelycostlycomputationcomparedtotheSLLfeatures.ThemostcommonlyusedfeaturesinthesesystemsarethezerocrossingrateandRMS.

2.2.2 Standard Low-Level (SLL) features TheseincludeRMS,ZeroCrossingRate,SpectralCentroid,SpectralRolloff,BandEnergyRatio,Flux(alsocalledDeltaSpectrumMagnitude),Bandwidth,pitchandpitchstrength.

Somefeatureslikebandwidthandpitchareself‐explanatory.Theothertestedfeaturesareexplainedinthefollowingsections.

2.2.2.1 RMS RMSorRootMeanSquareisameasureofamplitudeofasoundwaveinoneanalysiswindow.Thisisdefinedas

RMS =x12 + x2

2 + ...+ xn2

n

(1)

wherenisthenumberofsampleswithinananalysiswindowandxisthevalueofthesample.

2.2.2.2 Zero Crossing Rate Thisisameasurethatcountsthenumberoftimestheamplitudeofthesignalchangessign,i.e.crossingthex‐axis,withinoneanalysiswindow.Thefeatureisdefinedas

ZCR =1

T −1func

t=1

T−1

∑ stst−1 < 0{ }

(2)

wheresisthesoundsignaloflengthTmeasuredintimeandfunc{A}equals1ifAistrueand0otherwise.

TheZeroCrossingRatefeatureissometimesusedasaprimitivepitchdetectionformonosignals.Italsoisaroughestimateofthespectralcontent.

2.2.2.3 Spectral Centroid Thisfeatureiseffectiveindescribingthespectralshapeoftheaudio.Thefeatureiscorrelatedwiththepsychoacousticfeaturessharpnessandbrightness.ThereareseveraldefinitionsoftheSpectralCentroidfeatureinpreviouswork.InthisstudyitiscalculatedasaweightedmeanofthefrequenciesintheFFTtransformofthesignalas

Page 17: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

9

SC=

f (n)x(n)n= 0

N−1

x(n)n= 0

N−1

(3)

wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.

2.2.2.4 Spectral Rolloff AstheSpectralCentroid,theSpectralRolloffisalsoarepresentationofthespectralshapeofasound,andtheyarestronglycorrelated.It’sdefinedasthefrequencywhere85%oftheenergyinthespectrumisbelowthatfrequency.IfKisthebinthatfulfils

x(n) = 0.85 x(n)n= 0

N−1

∑n= 0

K

(4)

thentheSpectralRollofffrequencyisf(K),wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.

2.2.2.5 Flux (Delta Spectrum Magnitude) TheFlux,orDeltaSpectrumMagnitude,featureisameasureoftherateatwhichthespectralshapechanges,orfluctuates.Itiscalculatedbysummingthesquareddifferencesofmagnitudespectraoftwoneighboringframes.ThisfeaturehasshowngoodresultsintheSMDtaskin[14].

F = ( Xr k[ ] − Xr−1 k[ ] )2k=1

N / 2

(5)

whereNisthenumberofFFTpointsandXr[k]istheSTFTofframeratbink.

2.2.3 Frequency Cepstrum Coefficients ThesecondgroupistheFrequencyCepstrumCoefficients(FCC),whichincludestheMelFrequencyCepstrumCoefficients(MFCC)andtheLogarithmicFrequencyCepstrumCoefficients(LFCC).Theseareallpowerspectrumrepresentationfeaturescalculatedwithdifferentfrequencyscales.

ThemostfrequentlyusedcoefficientsforthesesystemsaretheMFCC.ThesearecomputedbytakingtheFFTofeveryanalysiswindow,mappingthespectrumtotheMelscale,takingthebase10logarithmsofthepowersandthenapplyingaDiscreteCosineFunction(DCT)todecorrelatethecoefficients.[14]

TheoverallperformanceofMFCCfeatureswasshownin[5]tobeslightlybetterthantheSLL‐features.ThisrelatestothefactthatMFCCperformsbetteratpopandrockmusic,butsomewhatworseatclassicalmusicthatcontainsverylittlevocalinformation.

Page 18: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

10

2.2.4 Psychoacoustic features Thesefeaturesaremorecloselybasedonourperceptionofsound,andarethereforecalledpsychoacoustic.

Loudnessisthesensationofsignalstrength,andisprimarilyasubjectivemeasureforustoranksoundsfromweaktostrong.Loudnesscanbecalculated(CalculatedLoudness)andisthenmeasuredinSone.OneSoneisdefinedastheloudnessofapure1000Hztoneat40dBre20µPa[21].

Roughnessisdescribedin[5]as“theperceptionoftemporalenvelopemodulationsintherangeofabout20­150Hz,maximalat70Hz”andisalsosaidtobeaprimarycomponentofmusicaldissonance.

Sharpnessisameasureofthehigh‐frequencyenergyrelatedtothelow‐frequencyenergystrength.Soundswithlotsofenergyinthehigherfrequencies,andlowenergylevelsinthelowerfrequenciesareconsideredsharp.

2.2.5 Special features In[1]theauthorsuseafeaturecalledChromaticEntropywhichisaversionofSpectralEntropy.ThespectrumisfirstmappedtotheMelscaleandthendividedintotwelvesub‐bandswithcenterfrequenciesthatcoincidewiththefrequenciesofthechromaticscale.Theenergyineachsub‐bandisthennormalizedbythetotalenergyofallthesub‐bands.Lastlytheentropyofthenormalizedspectralenergyiscalculatedas

E = − ni × log2(ni)i= 0

L−1

(6)

whereniisthenormalizedenergyofsub‐bandiandListhenumberofsub‐bands.

ThefeatureModifiedLowEnergyRatio(MLER)isintroducedin[8].Thefeatureexploitsthefactthatmusicshowslittlevariationinenergycontourofthewaveform,whilstspeechshowslargevariationsbetweenvoicingandfrication.MLERisdefinedastheproportionofframeswithRMSpowerlessthanavariablethresholdwithinonesecond.Itissuggestedbytheauthorsthatthethresholdshouldbeintheinterval[0.05%,0.12%]forbestperformance.

In[4]afeaturecalledWarpedLPC‐basedSpectralCentroid(WLPC‐SC)isintroduced.ThefrequencyanalysisismappedtotheBarkscaleandthenthecentroidfrequencyiscomputedbyaone‐polelpc‐filter.Thisfeatureexploitsthefactthatspeechhasalowcentroidfrequencythatvarieswithvoicedandunvoicedspeech,whilstmusichasachangingbehavior.

2.2.6 Psychoacoustic pitch scales PsychoacousticscalesarecommonlyusedinMusicInformationRetrieval(MIR)systems.Speechandmusicisoftenwelladjustedtoourearsandthereforehavemostinformationinthefrequencieswhereourearshavethebestresolution.ThemostusedscaleisMel,butevenBarkandEquivalent

Page 19: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

11

RectangularBandwidth(ERB)aresometimesused.Theoneusedinthispaper,Melfrequency,isdefinedas

f =1127.01048 × log f1700

+1

(7)

wheref1istheoriginalfrequency[1].

2.2.7 Extracting features Usuallyachosensetoffeaturesisextractedfromeachframeoftheaudio.Thefeaturesarethenoftennormalizedbythecomputedmeanvalueandthestandardvariationoveralargertimeunitandthenstoredinafeaturevector.

Featuresareusedintwoways,eitherbyusingtheextractedvalueorbyusingchangesovertime.Whenusingthechangesovertimeitispossibletocalculatestatisticalfeatureslikevarianceandstandarddeviation.In[5]itisshownhowusingchangesovertimearemoreaccuratethanonlyusingtheabsolutevaluesofthefeatures.

Onlyonefeatureisusedin[1],[4]and[7],althoughtheyalluseadvancedspecialfeaturesdescribedearlier.Othershavechosentohaveasetofstandardfeatures.In[3]fivedifferentfeaturesareused,energy,ZCR,SpectralEntropyandthetwofirstMFCCs.RMSandZCRareusedin[2].

2.3 Segmentation In[1]and[4]aregiongrowingtechniqueisusedforthesegmentationstep.Thistechniqueiswidelyusedinimagesegmentation,butcanalsobeusedforaudio.Anumberofframesareselectedasseeds.Thefeaturevectorsoftheseedsarethencomparedtotheframesnexttoit.Thesegmentthengrowswiththeneighboringframesaslongasthedifferenceinthefeaturesdoesnotexceedapredefinedthreshold.

Othersystemslikein[2]lookforbigchangesbetweentwoneighboring1secondframes.Thetwoneighboringframesfeaturevectorsarecompared,andiftheyaresufficientlydifferent,asegmentborderisdetected.Whenaborderisdetectedinaframethetransactionismarkedwithintheframewithanaccuracyof20ms.

2.4 Classification methods Whenthesegmentationprocessisdoneeachsegmentshouldbeclassifiedeitherasspeechormusic.Inamorecomplexsystemasin[3]moreclassescanbedefined,suchassilenceorspeechovermusic.Thelatterisoftenclassedasspeechinsystemswithonlythetwobasicclasses.

Theextractedfeaturevectorisusedtoclassifyeachsegment.Ameanvectoriscalculatedforthewholesegmentandisthencomparedeithertoresultsfromtrainingdataortopredefinedthresholds.

Page 20: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

12

Amethodwheretheclassificationisbasedontheoutputofmanyframestogetherisproposedin[10].Eachsecondconsistsof50frames,andeachframeisassignedaclassbyaquadraticGaussianclassifier.Then,aglobaldecisionismadebasedonthemostfrequentlyappearingclasswithinthatsecond.

2.4.1 Hidden Markov Models HMMsarecommonlyusedforclassification.In[3]theyareusedtogetherwithaBayesianNetworkClassifier.Thefeaturevectorsequenceisusedasinputtothemodel.Themodelhasastateforeachclass.Onlytwoclassesareusedin[3],oneforspeechandoneformusic.Probabilityoftransfersbetweenstatesarecomputedonlearningdataandstoredinthemodel.

In[6]aHMMwith24statesisusedforspeech,andanothermodelisusedformusic.Theuseofthreeorfourstatescouldcorrespondtosomephonemeclasses,insteadofhavingonlyoneclassforspeech.

Differentmethodsareusedtotrainclassificationmodels.TheViterbialgorithmisusedforHMMtrainingin[11]andtheBaum‐Welchalgorithmisanotheralternativeforthesystem’slearningprocess.

2.4.2 Refined classification Amethodtorefinetheresultsoftheclassificationmethodisdescribedin[8].Fourstatesareused,oneforspeech,oneformusic,onefortransitionstomusicandthelastfortransitionstospeech.Iftheclassifieroutputsamusicsegmentwhilstthesystemisinthespeechstate,thesegmentwillbestoredinastack.Iftheclassifierkeepsoutputtingmusicsegmentsforasettimethestatewillchangetomusic,andallthesegmentsonthestackwillbeclassedasmusic.Butiftheclassifieroutputsaspeechsegmentwithinthattime,thesystemwillgobacktothespeechstateandallthesegmentsonthestackwillbeclassifiedasspeech.Theaccuracyoftheclassificationisreportedtoincreaseby6.5%percentwhenthismethodisused.

Refinementcanbedoneinbothonline‐andofflinesystems,butcanbemademoreefficientinthelatterwherenodemandsonreal‐timeoutputarepresent.Therefinementtechniquedescribedaboveneedsasetnumberofsegmentstoperformwell.Whenthesesegmentsarefewertheperformancewilldecreasedrastically.

2.5 Evaluation methods Theresultsareoftenevaluatedasin[3]withthemeasuresrecall,precisionandtheoverallaccuracy.In[3]recallisdefinedastheproportionoftheframeswithaspecificclassthatwerecorrectlyclassifiedandprecisionisdefinedastheproportionoftheframesclassifiedasaspecificclass,thatactuallybelongedtothatclass.Atotalaccuracyisthencalculatedasthetotalpercentageofcorrectlyclassifieddata.

Page 21: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

13

Systemscanbeoptimizedforeithermusicorspeechtoraisetheprecisionofthatspecificclassinthesystem,althoughthisoftendecreasestheaccuracy.

2.6 Earlier results ComparingearliersystemsforSMDtasksisnoteasy.Thereisnostandarddatabaseforevaluatingthem,likethereisforspeakerandspeechrecognitionsystems.Thismakesithardtoactuallyranktheexistingsystems.Mostarticlesreportsystemswithaccuraciesabove90%andin[9]anaccuracyashighas98%isreported.

Page 22: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

14

3 Evaluation and tests Thischapterpresentsthetestsdonewithinthiswork.Theresultsofthetestsareevaluatedinordertobeabletofindthebestmethodforthespecifictask.Thetestphasewasdoneinthreesteps:Featureandclassificationtests,segmentationtestsandtestsofthecompletealgorithmwherebothclassificationandsegmentationwereevaluated.Thealgorithmtestsarepresentedattheendofchapter4.

3.1 Test database Atestdatabasewasneededtorunallthetests.Unfortunately,thereisnostandardizedtestdatabaseforSMDtasks,whichmakesthetestresultshardertocompare.However,inthisspecificcasethetestdatabasewasconstructedofmaterialfromSwedishRadiobroadcaststomatchtheactualmaterialthatwillbeusedasinputwhenthealgorithmisimplementedinaproductionenvironment.

MaterialwasselectedfromSwedishRadio’sdigitalarchiveDigastocoverallkindsofgenres.Thematerialismadeupofwholeprogramsthatwereairedonradioandwasselectedinordertogetawidespreadofincludedmaterial.

Threesetsoftestaudiowereselectedandextractedformthematerial.Thefirstsetconsistedof30secondslongaudiofilescontainingonlyspeechoronlymusic.Thespeechexamplesincludedfemaleandmalevoices,interviews,phoneinterviews,sportcommentaryandmore.Themusicexamplesrangedovermanygenresfromdifferentprograms.Thesewerethenusedforthefeatureteststobeabletovalueeachfeatureandcalculatethecorrelationbetweenthefeatures.Thesecondsetconsistedof1minutelongaudiofilescontainingbothspeechandmusicandwithatleastonetransitionbetweenclasses.Thefileswereselectedsothatthetransitioncanbeanywherewithinthefile.Thesewerethenusedforthesegmentationtests.Thethirdandlastsetconsistedofwholeprogramscontainingbothmusicandspeechsegments.Thelengthofthesefilesvariedfromacoupleofminutesupto90minutes.Thesewerethenusedtotestthecompletealgorithm.

3.2 Tools Audacity[15]wasusedtoeditandconvertaudiofilestoconstructthetestdatabase.

SonicVisualiser[16],togetherwithvariousVampplugins[17],hasbeenusedforearlyanalysisofthetestdatabase.SonicVisualiserisdevelopedbytheCentreforDigitalMusic,QueenMaryUniversityofLondon[18]andiseasytousetogetaquicklookathowvaluesoffeatureschangesinaudio.

MATLABhasbeenusedfortestingandevaluatingduringthewholeprocess.ManyofthefeatureshavebeenextractedusingtheMIRToolbox[19],

Page 23: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

15

developedbytheUniversityofJyväskylä[20]inFinland,andtherestofthefeatureswereextractedusingcustomwrittencode.

ThefinalalgorithmiswritteninCcodebecauseofitseffectivenessandspeed,usingthelibsndfilelibrary[13]forreadingwavefiles.

3.3 Feature tests Thefirsttestwastocheckiftheselectedfeatureshaveanysignificancefortheclassificationtask.ThetestedfeatureswerechosenfromearlieralgorithmsthatperformedtheSMDtaskwithgoodresultsandafewwereaddedbasedonpersonalhypothesizes.

Someinitialtestswerealsodonewithotherfeatureslikeflux,otherrhythmfeaturesanddifferentusesofMFCC,howeversincethesefeaturesdidnotshowanypotentialfortheSMDtasktheywerenotincludedinthesetests.

Intheseteststhefeaturesweretestedfortheclassificationpurposeandthecorrelationbetweenfeaturesweremeasured.Thechosenfeatureswereextractedfromthetestmaterialinthefirstset,containingonlyoneclasstogetadistributionofvaluesforeachclass.Mostfeaturesweretestedinfivedifferentways,theextractedvalue,thevariance,thestandarddeviation,thederivativeandthestandarddeviationofthederivative.Boththestandarddeviationandthevariancederivefromthesamedatasincethevarianceisthesquareofthestandarddeviation.Becauseofthis,onlythestandarddeviationresultsarepresented.Histogramsoftheresultswerethengeneratedtovisualizethedistribution.Featuresthatshowedinterestingresultsarefurtherdiscussedlaterinthischapter.

ThetestedfeatureswereRMSamplitude,ZeroCrossing‐Rate(ZCR),MelFrequencyCepstrumCoefficients(MFCC),SpectralCentroid(SC),PulseClarity(PC)andModifiedLowEnergyRatio(MLER).Theirabbreviationswillbeusedduringtherestofthischapter,togetherwithanabbreviationforthewaythefeatureisusedusingthefollowingnamingconventions.

SD StandardDeviationD DerivativeSDD StandardDeviationofDerivative

ThismeansthatthestandarddeviationoftheZeroCrossing‐RatewillbeabbreviatedZCR|SD.Atotalof29differentfeaturevariationsweretestedinthesetests.

Page 24: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

16

3.3.1 Feature test results

Feature Framelength AccuracyRMS 20ms 0.639RMS|SD 1s 0.829RMS|D 1s 0.548RMS|SDD 1s 0.764ZCR 20ms 0.588ZCR|SD 1s 0.837ZCR|D 1s 0.550ZCR|SDD 1s 0.835SC 1s 0.581SC|SD 30s 0.792SC|D 30s 0.589SC|SDD 30s 0.958MFCC1 20ms 0.517MFCC1|SD 1s 0.548MFCC1|D 1s 0.525MFCC1|SDD 1s 0.553MFCC2 20ms 0.521MFCC2|SD 1s 0.548MFCC2|D 1s 0.525MFCC2|SDD 1s 0.553MFCC3 20ms 0.521MFCC3|SD 1s 0.548MFCC3|D 1s 0.525MFCC3|SDD 1s 0.553MFCC4 20ms 0.519MFCC4|SD 1s 0.548MFCC4|D 1s 0.525MFCC4|SDD 1s 0.553MLER 1s 0.969PC 5s 0.807

Table1.Featuretestresults.Framelengthisatimemeasureandaccuracyisshowninpercentage/100.

ResultsofthefeaturetestsareshowninTable1above.Theaccuracyoftheeachfeatureisameasureofhowfarthespeechandthemusicdistributionarefromeachother.Thiswascalculatedbyfindingthethresholdvaluewiththeleastmisclassificationsbytestingallthresholdvaluesinanintervalspecifiedforeachfeature.Thenumbersofmisclassifiedframesforthebestthresholdwerethencountedanddividedbythetotalnumberofframesandthentheresultwassubtractedfrom1.

A =1− min(misclass)nFrames

(8)

Featureswithaccuracyresultscloseto50%canbeconsideredasrandom,containingnousefulinformationfortheclassification.Thethreshold

Page 25: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

17

optimizationdescribedaboveisthereasonthatallfeaturesperformedover50%.

ThefivefeatureswithhighestaccuracieswastheModifiedLowEnergyRatio(97%),thestandarddeviationofthederivativeoftheSpectralCentroid(96%),thestandarddeviationoftheZeroCrossing‐Rate(84%),thestandarddeviationoftheRootMeanSquare(83%)andthePulseClarity(81%).Surprisingly,noneoftheMFCCfeaturesshowedanyusefulresultsinthesetestandgottheworstresultsofallfeatures.TheextractedvaluesforeachMFCC‐parametersgavejustabove50%accuracieswiththeoptimalthreshold.Sincesmallvariationscandependontherathersmalltestdatabase,theMFCC‐featuresthemselvescanberegardedasuselessforthesekindsofclassificationtasks.Thereasonthatitgotover50%isalsobecausethebestthresholdvalueissought,andthereforegivesahigheraccuracy.However,MFCCfeaturescanstillbeusedinotherwaysdiscussedlaterinthispaper.

Thecorrelationbetweenthefourtopfeaturescannotbecalculateddirectlysincetheyusedifferentlengthofanalysisframesandthereforegeneratedifferentnumberofdatapoints.TheMLERfeatureusesthepausesinspeechtodiscriminatebetweentheclassesandsodoestheZCR,sotheywillshowhighcorrelationbetweenthem.Unfortunatelyallfourfeatureshaveproblemwiththesamekindofaudio,namelyspeechwithbackgroundnoiseofsomekind,whichareoftenclassifiedasmusic.Thecorrelationbetweenthetopfourfeatureshasbeencalculatedbycomparingthemeanvaluesforeachfeatureforeveryfileinthefirstsetinthetestdatabase.

MLER SC|SDD ZCR|SD RMS|SD PCMLER ‐ SC|SDD 0.93 ‐ ZCR|SD 0.97 0.97 ‐ RMS|SD 0.97 0.93 0.97 ‐ PC 0.87 0.77 0.77 0.83 ‐Table2.Cross‐correlationofthetopfourfeaturesfromtheaccuracytests.

AsseeninTable2,allthetopfourfeaturesarestronglycorrelated.ThePulseClarityfeatureshowstheleastcorrelationwiththeotherswhilethestandarddeviationoftheZeroCrossingRateshowedashighcorrelationsas97%withboththeModifiedLowEnergyRatioandthestandarddeviationofthederivativeoftheSpectralCentroid.

3.3.2 Analysis of Low-level features TheLow‐levelfeaturesareinterestingbecausetheyincurrelativelylowcomputationcosts.However,bothRMSandZCRshowedweakresultsinthefeaturetests.RMSperformedsomewhatbetterwhichcanbeseeninthehistogramsbelow,wherebigdifferencesbetweentheredandbluegivesgooddiscriminationpossibilities.BothRMSandZCRwhereextractedforeveryframe.SpeechtypicallycontainsmanyframeswithRMSvaluesclosetozero,whilemusichasalmostnozeroenergyframesandapeakatabout0.4.The

Page 26: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

18

closetozeroenergyframesinspeechrepresentsthepausesbetweensyllablesthatalwaysexistsinspeechrecordingswithoutbackgroundsound.TheZCRvaluesarecenteredroundthesamemeanvalue,butmusichasahigherpeak,i.e.alowerstandarddeviation.ThisisalsowhatgivesthegoodresultsforZCR|SDandZCR|V.

Figure9.Histogramsoffeaturevalues.Left:RMS.Right:ZC.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredaroundtheX‐values.

3.3.3 Analysis of MFCC TheMFCCfeatureshavebeenfrequentlyusedinMIRalgorithms.However,asseeninthefigurebelowalltheMFCCvaluesarecenteredroundthesamevalues,butwithsomedifferencesinthestandarddeviations.Asseeninthetestresultsthesevariationsofthefeaturesalsoscoredhigherthantheextractedvalues,butstillwithverylowtopscores.

Page 27: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

19

Figure10.Histogramoffeaturevalues.Topleft:MFCC1.Topright:MFCC2.Bottomleft:MFCC3.Bottomright:MFCC4.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.

3.3.4 Analysis of Modified Low Energy Ratio AsseeninTable1,theabsolutevaluesofthefeaturesareoftennotagoodwaytodiscriminatebetweenthetwoclasses.AbetterwaymightbetousethefeaturechangesovertimeasdonebytheMLERfeature.

Figure11.MLER.Left:Thresholdoptimization,Y‐axisshowscorrectlyclassifiedframepercentageandX‐axisshowsthethreshold.Right:Withthreshold0.26.Blueshowsmusicandredshowspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.

Page 28: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

20

Figure11showstheModifiedLowEnergyRatioasdescribedearlierinthereport.Whenusingathresholdforlowenergyat0.26ofthemeanRMSvalue,theclassificationachieved97%accuracy.IftheMLERequalszerotheframeisclassifiedasmusic,otherwiseitisclassifiedasspeech.AlmostallthespeechframeswithzeroMLERcomefromsportcommentarywithanaudienceinthebackground.Theleftgraphshowstheoptimizationofthetotalerror‐rate.Insomeapplicationsitmightbebettertooptimizeitforeitherspeechormusic.

3.4 Segmentation tests Thesegmentationtestsweredoneontheaudiofilescontainingatleastonetransitionbetweenclasses.Theexactpositionsofthetransitionswithinthefileswerenotedasagroundtruthforthetests.Thesepositionsarenotabsoluteandasmalltestshowedthatdifferentpeoplewouldplacethesepositionsdifferently.Sometimesatransitioncanbeuptohalfasecondlongsegmentofsilence,anditwouldbeacceptableforthesegmentationtoplacethetransitionanywhereinthissilence,sincesilencewasnotdefinedaseitherspeechormusic.Forthesetworeasons,adeviationof100mswasallowedandcountedasacleanhit.Adevianceof0.1‐1secondcountedasahit,andthedistancefromtheborderofthecleanhitsegmentwascalculated.Ifthesegmentationmarkedatransitionmorethan1secondawayfromthegroundtruth,thiscountedasamiss.

Threemeasureswerethenusedtoevaluatethesegmentationtechniques:

Hitefficiency: Ameasureofhowmanyhitswerefound.Themissesissubtractedfromthehits,andthedifferenceisthendividedbythetotalnumberoftransitions.

Hiteff =Hits−MissesTransitions

(9)

Hitaccuracy: Anaccuracymeasurewhereonlythehitsareconsidered.Allthedistancesareaddedandameanvaluewascalculatedbydividingbythetotalnumberofhits.Acleanhitcountsaszerodistance.

Hitacc =

Hitposn −Transposnn=1

N

∑N

(10)

WhereNisthenumbersofhits.

Hitrate: Asimplemeasurewhereonlythehitsareconsideredonceagain.Thetotalnumberofhitsisdividedbythetotalnumberoftransitionsinthetestfiles.

Hitrate =Hits

Transitions

(11)

Page 29: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

21

Noneofthesemeasuresconsidersthecomputationtimes:Thesearediscussedinthetestevaluationsbelow.

Inthefirstpartofthetest,theregiongrowingtechniquewastestedagainsttheneighboringdifferencetechnique,bothdescribedin2.3.Bothtechniquesweretestedwiththesamefeatures.

Technique Hitefficiency Hitaccuracy HitrateRegiongrowing 81% 79ms 100%Neighboringdifference 94% 44ms 99%Table3.Segmentationtechniquetestresults.

AsseeninTable3,theneighboringdifferencetechniqueachievedbetterresultsatboththeHitefficiencyandtheHitaccuracy.TheHitratemeasureishardertoevaluate,butresultsascloseto100%aspossiblearegood.Theneighboringdifferencetechniquedidnotdetect1%ofthehits,butascanbeseenintheHitefficiencymeasure,didnotdetectasmanymisseseither.AchoicewasmadetopursueonlytheNeighboringdifferencetechniquebecauseofitsefficiencyandaccuracy.

Feature Hitefficiency Hitaccuracy HitrateMLER 95% 19ms 99%SC|SDD 93% 70ms 99%ZCR|SD 88% 24ms 99%RMS|SD 87% 27ms 99%PC 88% 72ms 99%Table4.Segmentationfeaturetestresults,showingthetop5features.

Inthesecondtestallfeaturestestedinthefeaturetestsweretestedfortheneighboringdifferencetechnique.Thesamekindoffeaturesperformedwellalsointhesetests,althoughfeatureswithlowanalysiswindowsandshortframesperformedbetterandshowedbetterHitaccuracyresults.Table4showsthatthetop4featuresalldetected99%ofthetransitions,buttheMLERfeatureachievedthebestresultsbothfortheHitefficiencyandtheHitaccuracymeasures.FurthertestswerethendonewiththeMLERfeature,andboththeefficiencyandtheaccuracyimprovedwhenusingshorterframelengthsandinsteadusedacombinationofthemeanRMSandthevarianceofRMS.AHitefficiencyof96%andaHitaccuracyof17mswerethenachieved.

3.5 Design criteria Therearemanyaspectstoconsiderwhenchoosingfeaturestouseinthealgorithm.Thetestresultsneedtobethoroughlyanalyzedregardingtheseaspects:

• Accuracyofthefeaturetobeabletodiscriminatebetweenthetwoclasses,neededforbothclassificationandsegmentation.Idealfeatureshavesimilarvalueswithinoneclassandacleardifferencefromotherclasses.

Page 30: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

22

• Computationalcosts.Costlyoperationsshouldbeavoidedtomakethediscriminationtaskefficientandcheap.Evenofflinesystemsneedtobefasttoimprovethecostefficiency.

• Correlationbetweenchosenfeatures.Theuseoftwodifferentfeaturesneedstobemotivatedbyimprovedresults.Iftwofeaturesarehighlycorrelated,theimprovementsinaccuracywillberelativelycostly.

• Insensitivitytonoiseintheinputsignal.

TheMLERfeaturewaschosenforitsexcellentaccuracyandbecauseofitscombinationwiththeneighboringdifferencesegmentationtechnique.Boththesegmentationandtheclassificationarebasedononlyonesinglelowlevelfeature,theRMSamplitude.Thismakesthecomputationalcostsverylow.Sincethehighperformingfeatureswereallhighlycorrelateditwasnotmotivatedtoaddanotherfeature.Theimprovementsinaccuracywillbecostly,andanotherclassificationmethodwillbeneeded.WhentheRMSamplitudeisnormalizedoverthefilemaximumitisalsoinsensitivetosoundlevels,butthebackgroundnoisewillstillbeaproblemintheclassification

Thethreecalculatedmeasuresfromthesegmentationtestswhereconsideredwhenchoosingthesegmentationmethod.TheHitratewasconsideredthemostimportant,sincemisses(transitiondetectionswherenotransitioniswithinonesecond)canbediscardedinanalgorithmusingsegmentationrefinementmethods.TheHitefficiencywasonlyconsideredwhentwotestsshowedcommonresultsintheHitrate.Thecomputationtimestoo,wereconsideredwhenchoosingthefinalmethodforthealgorithm.

Theneighboringdifferencetechniqueshowedbetterresultsforallfeaturesandwasthereforethestrongestcandidate.ItisalsoagoodmatchwhentheMLERisusedfortheclassificationtasksinceitachievedgoodresultsinHitaccuracywhenusingRMS.Computationtimesarelargelyreducedsinceboththeclassificationandthesegmentationwillusethesameextractedfeatures.

Page 31: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

23

4 Algorithm ThischaptercontainsadetaileddescriptionofthefinalalgorithmthatwascodedanddeliveredtoSwedishRadio.ThealgorithmiswritteninCcodeandusesthelibsndfilelibrary[13]toreadandwritetowavefiles.Thediscriminationisdoneonaudiofilesandhencethisisanofflineprocedure.

4.1 Signal preprocessing Beforeanythingisdonewiththeaudio,afewpreprocessesareperformedontheaudiosignal.

Thereisnoaddedinformationinthedifferenceoftwochannelsthatcanbeusedfortheclassificationorthesegmentation.Thereforeitisdesirabletohaveamonosignaltosimplifylaterprocesses.Thealgorithmchecksthenumberofchannelsoftheaudio.Ifthesignalhasmorethanonechannel,itismixeddowntomono.

Theamplitudeofthesignalisthennormalizedtothemaximumamplitudeofthewholefiletoremoveanyeffectstheoverallamplitudelevelmighthaveonthefeatureextraction.

4.2 Feature extraction Aftertheaudiosignalhasgonethroughthepreprocessingpart,itissplitinto21msnon‐overlappingframes.TheRMSamplitudeisthencalculatedforeachframeusingequation1.

OncetheRMSamplitudehasbeencalculated,theframesaregroupedtogethertoform1second(48shortframes)analysisframes.Thesearealsonon‐overlapping.FourfeaturesbasedontheRMSamplitudevaluesarethenextractedfromeach1secondframe,meanRMS,varianceofRMS,alocallynormalizedvarianceoftheRMSandaModifiedLowEnergyRatio(MLER).ThenormalizedvarianceisthevarianceofRMSdividedbythemeanRMS.

TheModifiedLowEnergyRatioistheproportionoflowenergyshortframeswithinthe1secondframe.ThethresholdforlowenergydependsonthemeanRMSamplitude.ThemeanRMSamplitudeintheanalysisframeismultipliedwithapredefinedvalueddefinedbythetestresults.SpeechcontainsmanysmallpausesbetweensyllablesandwordsandthereforehasahigherMLERthanmusic.Thisisusedlatertoclassifyeachsegment.

4.3 Segmentation Thetaskforthesegmentationpartofthealgorithmistofindtheexactpositionoftransitionsbetweentwoclasses.Thesegmentationisbased,justastheclassification,onthefeaturesextractedearlier.Every1secondframeis

Page 32: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

24

examinedtolookforcandidatesfortransitionframes,andthenanexactpositionofthetransitionisfound.

Thisisamodifiedversionofthesegmentationdonein[2].Theadvantageofthismethodisthatitusesthesamebasefeature,RMSamplitude,astheclassification.Thismeansthatnofurtherfeatureextractionisneededandthissavescomputationtimeandminimizesthereadingoftheaudiofiles.TheRMSamplitudeisusedinanotherway,sincetheMLERfeatureusedforclassificationrequireslongeranalysisframes,whilethemeanandvarianceoftheRMSchangesasquicklyasdotheaudioclasses.

Thefirststepisdonebylookingattheframebeforeandtheframeaftertheexaminedframe.Ifthetwoneighboringframesaredifferenttheexaminedframeislikelytohaveatransition.Thetransitioncanbeanywherewithinthatsecondandtheexactpositionisdeterminedinthenextstep.Aproblemwilloccuriftheclasschangestwotimeswithintheexaminedframe.Thenthetwoneighboringframeswillnotdifferenoughtobechosenasacandidatefortransitionframes.Although,thesekindsoferrorsarenotcorrectedinthesegmentationsincesegmentssmallerthan2secondswillberemovedlater.

Figure12.Waveformwithtransitionfromspeechtomusic.Secondsaremarkedwithverticallines.

ThecomparisonbetweenneighboringframesarebasedonthemeanandthevarianceoftheRMSvalues.SincethedistributionoftheamplitudeoftheaudiosignalsisaLaplaciandistributionasshownin[2],aprobabilitydensityfunctionofthe

χ 2distributionisused,definedas

p(x) =xae−bx

ba+1Γ(a +1)

(12)

wherex≥0,Γisthegammafunctionandthetwoparameters,aandb,aredefinedas

a =µ2

σ 2 −1and

b =σ 2

µ

(13)

whereμisthemeanRMSandσ2istheRMSvariance.Thesimilaritymeasureisbasedontheprobabilitydensityfunction

p p1, p2( ) = p1 x( )p2 x( )dx∫ (14)

wherep1andp2referstotheprobabilityfunctionsofeachframe.Whenthe

χ 2distributioninequation(12)isinsertedinequation(14)thisgivesasimilaritymeasurecalculatedwith

Page 33: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

25

p(p1, p2) =Γa1 + a22

+1

Γ a1 +1( )Γ a2 +1( )2a1 +a22

+1b1a2 +12 b2

a1 +12

b1 + b2( )a1 +a22

+1

(15)

Sincethemeasureiscalculatedwithtwoframeswithoneframebetween,theexaminedframei,thedissimilaritymeasureisdefinedas

Dissim i( ) =1− p pi−1, pi+1( ) (16)Thiswillgivehighprobabilitiesofchangeevenforthesurroundingframes.AsseeninFigure12,outofthefivesecondsmarkedbytheverticallinessecond2and4willdiffermostinmeanandvarianceofRMS.However,second3and5willalsogivehighvaluesinthedissimilaritymeasure.Todampentheeffectofthiserror,afilterneedstobeapplied.Thesimilarityvalueisthereforelocallynormalizedover5secondswiththeexaminedframeinthecentre.Thenormalizingiscalculatedby

Dissimnorm (i) =Dissim(i) × D(i) − D(i − 2) + ...+ D(i + 2)

5

max(D(i − 2),...,D(i + 2))

(17)

Thedissimilaritymeasureismultipliedbyapositivedifferenceofthemeanoftheneighborhood.Ifthedifferenceisnegative,itissettozero.Thisisthendividedbythemaximumvalueoftheneighborhood.

Athresholdforthenormalizeddissimilarityvalueisthensetaccordingtheresultsofthetestmaterial,todeterminewhichframesareselectedascandidatesfortransition.ThethresholdisvariableanddependsonthevarianceoftheRMSintheneighboringframes.

Whenthecandidatetransitionframeshasbeenchosen,anexactpositionforthetransitionneedstobefound.Thisisdoneinawaysimilartothelaststep.Forevery20msframethepreviousandthenextonesecondsarecompared.Avalueisgivenforevery20msframefortheprobabilityofchange,andtheframewiththehighestprobabilityismarkedastheexactpositionofthetransition.

4.4 Classification OnlytheMLERfeatureisusedfortheclassificationpart.ApredefinedthresholdissetandallsegmentsthathaveahigheraverageMLERthanthethresholdareclassedasspeech,andeverythingbelowthethresholdisclassedasmusic.Theuseofonlyonefeaturereducesthecomputationtime.Thealgorithmneedsnotrainingmaterialtowork,sincethethresholdissetaccordingtotheresultsofthetestmaterial.

Page 34: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

26

4.5 Refinement Simplerefinementsofthesegmentsaredoneaftertheclassification.Iftwoconsecutivesegmentsaregiventhesameclasstheyaremergedtogetherandthetransitionbetweenthemiserased.

4.6 Output Whenalltheprocesses,segmentation,classificationandrefinementsaredone,theresultsarereadytobeoutput.Thealgorithmthencreatesasimpletextfilewithonelineforeachsegment.Thelinecontainstheexactpositionofthestartofthesegmentandabinarynumbershowingtheclassofthesegment.Alinewouldlooklikethis

1–0.0000000‐9.7708331–27.098166

wherethe0standsforspeech(1formusic)and9.770833istheexactpositionofthetransitionmeasuredinseconds.Thepositionofthetransitionismeasuredinsecondsbecauseofeasierhandlingforotherapplications.Ifthepositionweremarkedwithanexactframe,thethirdpartyapplicationwouldalsohavetoknowthesamplefrequencyoftheaudio.Thetransitionframeiseasilycalculatedby

F = t * sr (18)wheretisthepositionofthetransitionmeasuredinsecondsandsristhesamplerateoftheaudio.

AFlashplayerthatusestheoutputdatatomarkthesegmentscanthenreadthetextfileandmarkthesegmentsinthenavigationbar,asseeninFigure13.Thiscouldbeusedasaguidewheneditingtheaudio,orsimplytolookforerrorsintheoutput.

Figure13.Flashplayerwithmarkedsegmentsinthenavigationbar.Greenshowsspeechsegmentsandwhiteshowsmusicsegments.

TheBroadcastWaveFormatcontainsamarkerchunk,whichcouldbeusedtomarkthetransitionpointsofthesegments.Thisisnotyetintegratedintheapplication,butcouldbedoneforspecificimplementations.

Page 35: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

27

4.7 Results Thealgorithmtestswereperformedonthefinishedalgorithm.AudiofilescontainingfulllengthsprogramsfromSwedishRadiowereusedasinputandboththesegmentationandclassificationwereevaluatedatthesametime.Theresultingaccuracyisthepercentagewheretherightclassisfoundattherighttime.Thelengthofthecorrectlyclassedaudioisdividedbythetotallengthoftheprogram.

Speech MusicSpeech 95,4% 4,6%Music 1,9% 98,1%

Table5.Algorithmresults.Leftcolumnshowsinputandtoprowshowstheoutputofthealgorithm.

Theleftcolumnofthetableshowstheclassoftheinputtedaudio,andthetoprowshowstheclassoutputtedbythealgorithm.Speechreachesaloweraccuracybecauseofthesportcommentarysegmentsthatareoftenclassedasmusic,whilemusiciscorrectlyclassedasmusicasoftenas98,1%ofthetime.

Right WrongTotal 97,3% 2,7%

Table6.Summarizedresultsforallinputstothealgorithm.

Thisgivesatotalaccuracyof97,3%sincethetestmaterialcontainedmoremusicthanspeech.

Thecomputationtimeofthealgorithmvariesdependingontheformatoftheinputandthenumberofinputchannels.Thecomputationtimedidnotexceed1%ofthelengthoftheaudioinanyofthetestfiles.

Page 36: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

28

5 Conclusions Thefinalalgorithmgivesanaccuracyofover97%inthetestsperformedwithmaterialfromSwedishRadio.Thismatchestheresultsreportedinearlierwork,yetwithoutanyadvancedfeaturesthatrequirelongcomputationtimes.97%isalsoenoughformostapplications.Insomeapplicationsthemisclassifiedaudiostillneedstobeconsidered.Sinceweknowfromthetestswhatkindofaudiothatgiveslowaccuracies,thiscanbedonebymanuallydiscriminatingthesefiles.

Agraphicalinterfacewherethesegmentsaremarkedcouldbeofgooduseformanualworkwiththeaudio.Thealgorithmdoesthediscriminationjob,buttheresultsmightstillneedrefinement.Thisistrueforanapplicationforautomaticeditingforpodmaterial,wheretheresultsofthediscriminationcanbeusedasaguide,butsomeeditinglikecross‐fadesisstillneededtogetagoodsoundingresult.

Offlinesystemsbenefitmostfromfastercomputationtimes.Real‐time,onlinesystemsstillhavetousethesamelengthfortheanalysiswindowtogathertheneededstatisticsandcanbenefitonlybybeingabletousemoreandmoreadvancedfeatures.Offlinesystems,ontheotherhand,canbothusemoreadvancedfeaturestoachievehigheraccuracyandatthesametimecomputefaster.

Page 37: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

29

6 Future work Therearestillendlessfeaturesandfeaturecombinationstobetestedforthistask.Thefeaturestestedduringthisworkarestillquitesimple.Eventhemodifiedspecialfeaturesliketheoneusedinthealgorithmarebasedonlyononelowlevelfeature.FurthertestscouldalsobedonewithMFCCfeatures,eventhoughtheyshowedpoorresultsinthepresenttests.CombinationsofdifferentMFCCfeatureshavebeentestedinearlierwork.FurtherexploringrelationsbetweendifferentMFCC,suchasdistancesbetweenthesecondandthirdcoefficient,couldgivegoodresults.

Morecomplexfeatureslikepitchcurvesextractionhavebeendiscussedandtentativelytested,butthereisstillmuchtoexploreinthisarea.PitchfeaturesshouldbeagoodcomplementfortheMLERfeaturesincetheyworkondifferentaspectsofthesound.AlsofeaturesbasedonrhythmcouldbeagoodalternativetocomplementtheMLERfeature.

Addingsubclassestoboththemusicandthespeechclasseswouldbeusefulformanyapplications.SwedishRadiohasalreadyrequestedthepossibilitytobeabletodiscriminatebetweenmaleandfemalespeakers.Musiccouldbesplitintogenresub‐classesforfurtherdiscrimination.Thereisalreadysomeworkdonewithgenreclassification,butmoreresearchisneededtocreateausefulapplication.

Constructingalowcostreal‐timediscriminatorthatcouldbeinsertedincarradioreceiverscouldbeanotherkindofproject.Listeningincarsofteninvolvesnoisyenvironmentswherespeechneedstobeamplifiedmorethanmusictoincreasetheaudibility.Differentlisteningenvironmentsdemanddifferentamplificationsofspeech.Insuchanapplication,theprocessingcannotbedoneinthetransmitterandneedsbeprocessedinreal‐time.

Page 38: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

30

7 Acknowledgements ThisworkhasbeendonewithmuchhelpfrombothSwedishRadioandtheRoyalInstituteofTechnology.Specialthanksto:

MysupervisoratSwedishRadio,BjörnCarlsson,forallthehelpwithtechnicalquestionsaboutradioandalsoforalltheencouragementduringthework.

MysupervisorattheRoyalInstituteofTechnology,AndersFriberg,forallthefeedbackandideas.Alsoforallthespecialknowledgeinfeatureextractionandclassification.

HasseWessmanandLarsJonssonatSwedishRadio,TechnicalDevelopmentforgivingmetheopportunitytodothisprojectandgivingmeaninspiringworkingenvironment.

TherestofthestaffatSwedishRadio,TechnicalDevelopmentforfeedbackandmotivation.

MyfriendJohnHäggkvist,whohashelpedmeoutduringthecodingofthealgorithm.

Page 39: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

31

8 References [1] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Acomputationally

efficientspeech/musicdiscriminatorforradiorecordings”,inUniversityofVictoria,ISBN:1‐55058‐349‐2,pp.107‐110,2006.

[2] Panagiotakis,C.&Tziritas,G.“ASpeech/MusicDiscriminatorBasedonRMSandZero­Crossings”,IEEETransactionsonMultimedia,vol.7(1),pp.155‐166,Feb.2005.

[3] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Speech/MusicDiscriminationforradiobroadcastsusingahybridHMM­BayesianNetworkarchitecture”,inProc.ofthe14thEuropeanSignalProcessingConference(EUSIPCO‐06),September4‐8,2006,Florence,Italy.

[4] Muñoz‐Expósito,J.E.,Garcia‐Galán,S.,Ruiz‐Reyes,N.,Vera‐CandeasP.&Rivas‐Peña,F.“Speech/MusicdiscriminationusingasinglewarpedLPC­basedfeature”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2005.

[5] McKinney,M.F.&Breebart,J.”FeaturesforAudioandMusicClassification”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2003.

[6] Karnebäck,S.”Speech/MusicDiscriminationUsingDiscreteHiddenMarkovModels”,TMH‐QPSR,KTH,Vol.46,41‐59,2004.

[7] Alnabadi,M.S.”RealTimeSpeechMusicDiscriminationUsingASingleFeature”,DurhamUniversitySchoolofEngineering,2007.

[8] Wang,W.Q.,Gao,W.,Ying,D.W.“AFastandRobustSpeech/MusicDiscriminationApproach”,InProceedingsoftheInternationalConferenceonInformation,CommunicationsandSignalProcessing,2003.

[9] Saunders,J.“Real­timediscriminationofbroadcastspeech/music”,inProc.IEEEIntern.Conf.onAcoustics,SpeechandSignalProcessing,1996.

[10] El‐Maleh,K.,Klein,M.,Petrucci,G.&Kabal,P.“Speech/Musicdiscriminationformultimediaapplications”,ICASSPOO,2000.

[11] Ajmera,J.,McCowan,I.&Bourlard,H.“Speech/MusicsegmentationusingentropyanddynamismfeaturesinaHMMclassificationframework”,inSpeechCommunication40,pp.351‐363,2003.

[12] SwedishRadiowebpage,http://www.sr.se/sida/default.aspx?ProgramId=2438.Retrieved6/12010.

[13] Libsndfilewebpage,http://www.mega‐nerd.com/libsndfile/.Retrieved6/12010.

[14] Burred,J.J.,“AnObjectiveApproachtoContent­BasedAudioSignalClassification”,TechnischeUniversitätBerlin,2003.

[15] Audacitywebpage,http://audacity.sourceforge.net/.Retrieved24/22010.

[16] SonicVisualiserwebpage,http://www.sonicvisualiser.org/.Retrieved24/22010.

[17] Vamppluginswebpage,http://vamp‐plugins.org/.Retrieved24/22010.

Page 40: Automatic speech/music discrimination in audio files · Automatic speech/music discrimination in audio files Lars Ericsson Master’s thesis in Music Acoustics (30 ECTS credits) at

Automaticspeech/musicdiscriminationinaudiofiles

32

[18] CentreforDigitalMusicwebpage,http://www.elec.qmul.ac.uk/digitalmusic/.Retrieved24/22010.

[19] MIRtoolboxwebpage,https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox.Retrieved24/22010.

[20] UniversityofJyväskyläwebpage,https://www.jyu.fi/en/.Retrieved24/22010.

[21] Leijon,A.“SoundPerception:IntroductionandExerciseProblems”.RoyalInstituteofTechnology,Stockholm,2007.


Recommended