Automatic speech/music
discrimination in audio files
Lars Ericsson
Master’s thesis in Music Acoustics (30 ECTS credits)
at the School of Media Technology Royal Institute of Technology year 2009
Supervisor at SR was Björn Carlsson Supervisor at CSC was Anders Friberg
Examiner was Sten Ternström
Automatic speech/music discrimination in audio files
AbstractThismaster’sthesispresentsanalgorithmfordiscriminationofspeechandmusicinaudiofiles.ThealgorithmismadeforSwedishRadioandisthereforesuitedfortheirneedsandoptimizedfortheirmaterial.AseriesoftestsonacousticfeaturesextractedfromSwedishRadiobroadcastswasperformedtodistinguishthebestmethodforthediscrimination.Thesetestswereevaluatedtofindthemostsuitedmethodforthetask.MethodsbasedonRMSamplitudeofthesignalwerechosenforbothclassificationandsegmentation.AfeaturefortheproportionoflowenergythatusesthesmallpausesthatcantellspeechfrommusicwasusedfortheclassificationandasimilaritymeasurebasedonmeanandvarianceoftheRMSwasusedtofindthetransitionpointsforthesegments.Thisresultedinanalgorithmthatwithanaccuracyof97,3%candiscriminatespeechfrommusicinSwedishRadiobroadcasts.
Automatisk diskriminering av tal och musik i ljudfiler
Sammanfattning Dettaexamensarbetepresenterarenalgoritmfördiskrimineringavtalochmusikiljudfiler.AlgoritmenärgjordförSverigesRadioochärdärförutformadutefterderasbehovochoptimeradförderasmaterial.FörattkommaframtilldenbästametodenfördiskrimineringengjordestesterpåakustiskaparametrarsomextraheratsurmaterialfrånSR:sdigitalaarkiv.Dessatesterutvärderadessedanföratthittadenmetodsomvarbästlämpadförändamålet.FörbådeklassificeringochsegmenteringvaldesmetodersombaseraspåRMSföramplitudenavsignalen.EnparameterförlågnivåproportionersomutnyttjardekortapausernasomskiljertalfrånmusikvaldesförattklassificerasegmentochettlikhetsmåttbaseradpåmedelvärdeochvariansavRMSvaldesföratthittagränsernatillsegmenten.Dettaresulteradeienalgoritmsommed97,3%träffsäkerhetkandiskrimineratalfrånmusikiSR:sprogramutbud.
Contents
1 Introduction ................................................................................. 1 1.1 Speech/Music discrimination ..................................................... 1 1.2 Swedish Radio .............................................................................. 1 1.3 Goal............................................................................................... 1 1.4 Method.......................................................................................... 1 1.5 Limitations ..................................................................................... 2 1.6 Overview of the paper................................................................ 2
2 Background................................................................................. 3 2.1 System structure ........................................................................... 3 2.1.1 Online systems...................................................................... 3 2.1.2 Offline systems...................................................................... 3 2.2 Features and feature extraction................................................ 4 2.2.1 Speech and music .............................................................. 4 2.2.2 Standard Low-Level (SLL) features .................................... 8 2.2.2.1 RMS.............................................................................. 8 2.2.2.2 Zero Crossing Rate..................................................... 8 2.2.2.3 Spectral Centroid ...................................................... 8 2.2.2.4 Spectral Rolloff ........................................................... 9 2.2.2.5 Flux (Delta Spectrum Magnitude) ........................... 9 2.2.3 Frequency Cepstrum Coefficients .................................... 9 2.2.4 Psychoacoustic features .................................................. 10 2.2.5 Special features ................................................................. 10 2.2.6 Psychoacoustic pitch scales ............................................ 10 2.2.7 Extracting features............................................................. 11 2.3 Segmentation............................................................................. 11 2.4 Classification methods .............................................................. 11 2.4.1 Hidden Markov Models..................................................... 12 2.4.2 System learning.................................................................. 12 2.4.3 Refined classification ........................................................ 12 2.5 Evaluation methods................................................................... 12 2.6 Earlier results................................................................................ 13
3 Evaluation and tests..................................................................... 14 3.1 Test database................................................................................ 14 3.2 Tools ................................................................................................ 14 3.3 Feature tests................................................................................... 15 3.3.1 Feature test results ............................................................ 16 3.3.2 Low-level features............................................................. 17 3.3.3 Mel Frequency Cepstrum Coefficients .......................... 18 3.3.4 Modified Low Energy Ratio.............................................. 19 3.4 Segmentation tests ....................................................................... 20 3.5 Test evaluations............................................................................. 21
4 Algorithm ...................................................................................... 23 4.1 Signal preprocessing .................................................................... 23 4.2 Feature extraction ........................................................................ 23 4.3 Segmentation................................................................................ 23 4.4 Classification.................................................................................. 25 4.5 Refinement .................................................................................... 26 4.6 Output ............................................................................................ 26 4.7 Results ............................................................................................. 27
5 Conclusions .................................................................................. 28
6 Future work ................................................................................... 29
7 Acknowledgements .................................................................... 30
8 References.................................................................................... 31
List of Abbreviations
DCS DiscreteCosineFunctionERB EquivalentRectangularBandwidthFFT FastFourierTransformLFCC LinearFrequencyscaledCepstrumCoefficientsLPC LinearPredictiveCodingMFCC MelFrequencyCepstrumCoefficientsMLER ModifiedLowEnergyRatioRMS RootMeanSquareSC SpectralCentroidSLL StandardLowLevel(features)SMD Speech/MusicDiscriminationSR SwedishRadioZCR ZeroCrossingRate
Automaticspeech/musicdiscriminationinaudiofiles
1
1 Introduction Thischapterincludesanoverviewofthetask,thepurpose,methodandlimitationsofthework.Italsogivesabriefviewoftherestofthereport. 1.1 Speech/Music discrimination Thepurposeofspeech/musicdiscrimination(SMD)systemsistodivideaudiotomusicandspeechsegmentsandclassifythem,whetherthediscriminationisdoneinrealtimeoronrecordedaudiofiles.
TheSpeech/MusicDiscriminationtaskisanimportantpartofAutomaticSpeechRecognition(ASR)systems,whereitisusedtodisabletheASRwhenmusicorotherclassesofaudioarepresentinautomatictranscriptionofspeech.
SMDsystemsarealsousefulforbit‐ratecoders.Speechcodersachievebetterresultsforspeechcodingthanmusiccodersdo,andviceversa,thereforeitisimportanttodiscriminatebetweenthetwoaudioclasses,toselecttherighttypeofbit‐ratecoding.
WhenindexingsoundandevenvideoaSMDsystemcanbeofgreatimportance.Apartfromusingthedetectionofspeechandmusicinaudiofiles,thesamealgorithmcanbeappliedtotheaudioofTVshowsormovies.Thiscanlaterbeusedforindexingthevideomaterialtobeabletojumpstraighttoadesiredpartine.g.on‐demandvideosolutionsfortheweb.
1.2 Swedish Radio SRisanon‐commercial,publicserviceradiobroadcasterwithover40radiochannels,includingfournationalFMchannels(P1,P2,P3andP4)and28localchannels.P4isthebiggestradiochannelinSweden.SRisalsoofferingmorethan10channelsonlineontheirwebsite,togetherwithanarchiveofallbroadcastedprogramsavailableondemand.Broadcastsarealsoavailableviashortwave,mediumwaveandsatellite.[12]
1.3 Goal AsystemthateffectivelydiscriminatedbetweenspeechandmusicwillbeusefulforSwedishRadioinmanyoftheabovementionedapplications.ThegoalisthereforetocreateanalgorithmthatcandothetaskaccuratelyforthematerialproducedbySwedishRadio.Tomeettherequirementsofacostefficient,accuratealgorithm,anofflinesystemstructurewaschosen.
1.4 Method Theprojectstartedbyreviewingnecessaryliteratureandarticlesofpreviouswork.Thismadeitpossibletoenterthetestphasewithknowledgeofthearea.Inthisphasemanyfeaturesanddiscriminationmethodsweretestedandevaluatedtobeabletofindthemethodbestsuitedforthisspecifictask.ThechosenmethodwasthencodedinawaythatpermittedanintegrationintoexistingsystemsatSwedishRadio.
Automaticspeech/musicdiscriminationinaudiofiles
2
1.5 Limitations Thisworkfocusedonthediscriminationbetweenspeechandmusic,anddidnotafinerdiscriminationwithintheclasses.Thus,thealgorithmwillnotbeabletodiscriminatebetweendifferentspeakerssuchaswomen,menorchildrenandwillnotbeabletotelldifferentvoicesapart.Musicwillonlyconsistofoneclass,andwillnotbefurtherdividedintodifferentclassesfordifferentgenres.
1.6 Overview of the paper ThepaperstartswithabackgroundchapterincludinganintroductionoftheareaofMusicInformationRetrievalandanoverviewofearlierworkdoneinthespeech/musicdiscriminationarea.Thechapteralsopresentsfeaturesthatarecommonlyusedforthesekindsofsystems.Chapter3containsallthetestsdonewithinthiswork.Italsopresentsthetestresultsandendswithanevaluationoftheresults.Chapter4givesadetaileddescriptionofthefinishedalgorithmandthepaperthenendswithachapterincludingfinalconclusionsdrawnfromtheproject.
Automaticspeech/musicdiscriminationinaudiofiles
3
2 Background ThischapterwillgiveanintroductiontothepartoftheMusicInformationRetrievalresearcharea,whichhasbeenusedduringthisproject.Itwillalsogiveabrieflookatearlierwork.
2.1 System structure EarliersystemsdevelopedforSMDtaskshavedifferentstructures.Theycanbedividedintotwomaingroups,onlinesystemswherethediscriminationismadeinreal‐timeandofflinesystemswherethediscriminationismadeonaudiofiles.Bothgroupshavetheiradvantages;onlinesystemsarebettersuitedforlivepurposeswhileofflinesystemscanbemademoreaccurateandfaster.Differenttasksrequiredifferentsystems,suitedforspecificpurposes.
2.1.1 Online systems Inonlinesystems,boththesegmentationandtheclassificationtasksneedtobedoneatthesametime.Theycanevenberegardedasonetask,wheretheoutputoftheclassifierisusedforthesegmentation.Thisisalldoneinreal‐time.”Real‐time”meansthatthesystemoutputsresultscontinuouslyastheinputaudiostreamcomesin,butwithadelayofatleastoneanalysisframe.
Onlinesystemsoftenconcentrateonfindinglargechangesintheaudiotobeabletofindbordersofspeechormusicsegments.ThisisdonebySaundersin[9]bydividingtheaudiointonon‐overlapping16mslongframes(256samplesat16KHz)fromwhichsimplefeaturesareextractedandusingalonger2,4second(150x16msframes)analyzeframeforstatisticalfeaturesusedintheclassification(seenextchapterformoreonfeatures).Hereamultivariate‐Gaussianclassifierdoestheclassification.
Anotherreal‐timesystemisdescribedin[7].Atypicalframe‐sizeofearlieronlinesystemsisaround20msandtheanalysisframevariesfromhalfaseconduptothreeseconds.Theuseofstatisticalfeaturesdemandslongeranalysisframestogetgoodresults.Theshortframeisusedtoextractfeaturesandthelongeranalysisframeisusedtoextractstatisticsofthesefeatures,suchasvarianceandmeanvalues.
2.1.2 Offline systems Thestructureofofflinesystemsvariesconsiderably.Themostcommonsystemhasthreestepspluspre‐andpost‐processing,asseeninFigure1.Firsttheaudioisdividedintoframes.Fromeachoftheframesasetoffeaturesisextractedandstoredinafeaturevector.Inthesecondstepthesystemsegmentsthesoundandinthethirdstepthesegmentsareclassifiedbyusingsomekindofclassificationmethodtodecidewhetherasegmentconsistsofspeechormusic.TheSMDtaskisoftendoneinmorestepsinanofflinesystemthaninanonlinesystem.Thesegmentationcanberefinedasin[8]byusingthefactthatneighboringsegmentshavehighprobabilitiesofcontainingthesameclass.
Automaticspeech/musicdiscriminationinaudiofiles
4
Figure1.Blockdiagramofanofflinesystemstructure
In[8]someclassificationisdoneduringthesegmentation,andthemoredifficultsegmentsareleftforanothermorecomplexclassificationmethod.Thissavescomputationtime,sinceasimplerclassifiermakesthefirstclassification.Othersystemsclassifieslargesegments,andthendividethesegmentswhereaborderisfoundintosmallersegments,whichalsoisawaytosavecomputationtime.
Offlinesystemsoftenuseaframelengthofaround20msandananalysiswindowlengtharound1second.Shorterframesmakethefeaturesmoresensitivetonoise,whilelongerframesmightincludetoomanyphonemes.
2.2 Features and feature extraction Soundhasmanyfeatures,somethatourearscanpickupandsomethatwecannotevenhear.Speechhasbeencloselystudiedandisrelativelywelldefined,whilstmusicisamuchwiderclassofsound.TheFrenchcomposerEdgarVarèsedefineditas“Musicisorganizedsound”.Thismightseemabstract,butcanbeusedeveninthesekindsoftechnicalsolutions.Itisacommonapproachtolookforrepetitioninsoundtoclassifyitasmusic.
2.2.1 Speech and music Tobeabletodistinguishbetweenspeechandmusic,featuresthatdifferbetweenthetwoclassesneedtobeused.Asimplelookatthewaveformof1minuteexcerptsofspeech,popmusic,classicalmusicandopera(allexamplestakenfromSwedishradioSRbroadcasts)alreadyindicateslargedifferencesbetweentheclasses.ThespeechwaveforminFigure2showsrapidchangesinenergyandamplitudethatnoneofthemusicwaveformsdoes.TheheavilycompressedwaveformofTheKillers’songinFigure3seemstototallylackdynamics,whiletheclassicalpieceinFigure4andtheoperapieceinFigure5hassomeshortamplitudepeaksandshowslargedynamicvariations.
Automaticspeech/musicdiscriminationinaudiofiles
5
Figure2.1minuteofspeech.
Figure3.1minuteexcerptofTheKillers–AllTheseThings.
Figure4.1minuteofclassicalmusic.
Figure5.1minuteofopera.
Eveniftheclassesareeasytoidentifyinthewaveform,theexactpositionofthetransitionscanbedifficulttodetect.Figure6showsanexcerptfromP3Popwhereapopsongabruptlystopsandthehostoftheshowstartsspeaking.Thetransitionismarkedwithaverticalline.
Figure6.WaveformrepresentationofexcerptofP3Popwithtransitionfrommusictospeechmarked.
ThechangesinenergyareevenclearerwhenthesoundwaveisplottedinFigure6.Thescalesaren’tthesameinthefourexamples,butthemostinterestingthingisthechanges.InalltheexamplescontainingmusictheRMSnevergoesdowntozeroanddoesn’tdivergelargelyfromthemeanRMS.However,inthespeechexampletherearerelativelymanyframescontainingzeroorclosetozeroRMS,andthechangesarerapid.WhatlookslikelargevariationsinRMSvaluesofthepopmusicinFigure7canbeexplainedbythescaleontheY‐axis,rangingfrom0to0.1,whilespeechrangesfrom0to0.25.
Automaticspeech/musicdiscriminationinaudiofiles
6
Figure7.RMSgraphs.Y‐axisshowstheRMSvaluecalculatedfrom20msframesandX‐axisshowstemporallocationins.Topleft:Speech.Topright:Popmusic.Bottomleft:Classicalpiece.Bottomright:Operapiece.
BylookingatthespectrumofthefourexamplesinFigure8itisclearthatallthemusicexampleshaveahigherpeakinthelowfrequencies,althoughthepeakoccursatdifferentfrequencies.Thispeakmostlikelycorrespondstothefundamentalfrequencyofthevocalcomponents.Theclassicalpiece,whichlacksvocals,doesnothavethesamesharppeakastheotherthreeexamples.Thespeechexamplehasmoreenergyinthefrequencyrangearound10000Hzthanthemusicexamples.
Automaticspeech/musicdiscriminationinaudiofiles
7
Figure8.SpectrumplotsanalysedusingaHanningwindowswithawindowsizeof1024samples.Y‐axisshowssoundlevel(dB)andX‐axisshowsfrequency(Hz).Fromthetop:speech,popmusic,classicalpiece,operapiece.
Automaticspeech/musicdiscriminationinaudiofiles
8
Thefeaturesdescribedbelowaredividedintothreegroupsaccordingto[5]:ThesimplerStandardlow‐level(SLL)features,theFrequencyCepstrumCoefficientsandthemoreadvancedPsychoacousticfeatures.
OnlinesystemsoftenusesimplerfeaturestoavoidhavingtocomputetheFFTtransform,whichisarelativelycostlycomputationcomparedtotheSLLfeatures.ThemostcommonlyusedfeaturesinthesesystemsarethezerocrossingrateandRMS.
2.2.2 Standard Low-Level (SLL) features TheseincludeRMS,ZeroCrossingRate,SpectralCentroid,SpectralRolloff,BandEnergyRatio,Flux(alsocalledDeltaSpectrumMagnitude),Bandwidth,pitchandpitchstrength.
Somefeatureslikebandwidthandpitchareself‐explanatory.Theothertestedfeaturesareexplainedinthefollowingsections.
2.2.2.1 RMS RMSorRootMeanSquareisameasureofamplitudeofasoundwaveinoneanalysiswindow.Thisisdefinedas
€
RMS =x12 + x2
2 + ...+ xn2
n
(1)
wherenisthenumberofsampleswithinananalysiswindowandxisthevalueofthesample.
2.2.2.2 Zero Crossing Rate Thisisameasurethatcountsthenumberoftimestheamplitudeofthesignalchangessign,i.e.crossingthex‐axis,withinoneanalysiswindow.Thefeatureisdefinedas
€
ZCR =1
T −1func
t=1
T−1
∑ stst−1 < 0{ }
(2)
wheresisthesoundsignaloflengthTmeasuredintimeandfunc{A}equals1ifAistrueand0otherwise.
TheZeroCrossingRatefeatureissometimesusedasaprimitivepitchdetectionformonosignals.Italsoisaroughestimateofthespectralcontent.
2.2.2.3 Spectral Centroid Thisfeatureiseffectiveindescribingthespectralshapeoftheaudio.Thefeatureiscorrelatedwiththepsychoacousticfeaturessharpnessandbrightness.ThereareseveraldefinitionsoftheSpectralCentroidfeatureinpreviouswork.InthisstudyitiscalculatedasaweightedmeanofthefrequenciesintheFFTtransformofthesignalas
Automaticspeech/musicdiscriminationinaudiofiles
9
€
SC=
f (n)x(n)n= 0
N−1
∑
x(n)n= 0
N−1
∑
(3)
wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.
2.2.2.4 Spectral Rolloff AstheSpectralCentroid,theSpectralRolloffisalsoarepresentationofthespectralshapeofasound,andtheyarestronglycorrelated.It’sdefinedasthefrequencywhere85%oftheenergyinthespectrumisbelowthatfrequency.IfKisthebinthatfulfils
€
x(n) = 0.85 x(n)n= 0
N−1
∑n= 0
K
∑
(4)
thentheSpectralRollofffrequencyisf(K),wherex(n)representsthemagnitudeofbinnumbern,andf(n)representsthecenterfrequencyofthatbin.
2.2.2.5 Flux (Delta Spectrum Magnitude) TheFlux,orDeltaSpectrumMagnitude,featureisameasureoftherateatwhichthespectralshapechanges,orfluctuates.Itiscalculatedbysummingthesquareddifferencesofmagnitudespectraoftwoneighboringframes.ThisfeaturehasshowngoodresultsintheSMDtaskin[14].
€
F = ( Xr k[ ] − Xr−1 k[ ] )2k=1
N / 2
∑
(5)
whereNisthenumberofFFTpointsandXr[k]istheSTFTofframeratbink.
2.2.3 Frequency Cepstrum Coefficients ThesecondgroupistheFrequencyCepstrumCoefficients(FCC),whichincludestheMelFrequencyCepstrumCoefficients(MFCC)andtheLogarithmicFrequencyCepstrumCoefficients(LFCC).Theseareallpowerspectrumrepresentationfeaturescalculatedwithdifferentfrequencyscales.
ThemostfrequentlyusedcoefficientsforthesesystemsaretheMFCC.ThesearecomputedbytakingtheFFTofeveryanalysiswindow,mappingthespectrumtotheMelscale,takingthebase10logarithmsofthepowersandthenapplyingaDiscreteCosineFunction(DCT)todecorrelatethecoefficients.[14]
TheoverallperformanceofMFCCfeatureswasshownin[5]tobeslightlybetterthantheSLL‐features.ThisrelatestothefactthatMFCCperformsbetteratpopandrockmusic,butsomewhatworseatclassicalmusicthatcontainsverylittlevocalinformation.
Automaticspeech/musicdiscriminationinaudiofiles
10
2.2.4 Psychoacoustic features Thesefeaturesaremorecloselybasedonourperceptionofsound,andarethereforecalledpsychoacoustic.
Loudnessisthesensationofsignalstrength,andisprimarilyasubjectivemeasureforustoranksoundsfromweaktostrong.Loudnesscanbecalculated(CalculatedLoudness)andisthenmeasuredinSone.OneSoneisdefinedastheloudnessofapure1000Hztoneat40dBre20µPa[21].
Roughnessisdescribedin[5]as“theperceptionoftemporalenvelopemodulationsintherangeofabout20150Hz,maximalat70Hz”andisalsosaidtobeaprimarycomponentofmusicaldissonance.
Sharpnessisameasureofthehigh‐frequencyenergyrelatedtothelow‐frequencyenergystrength.Soundswithlotsofenergyinthehigherfrequencies,andlowenergylevelsinthelowerfrequenciesareconsideredsharp.
2.2.5 Special features In[1]theauthorsuseafeaturecalledChromaticEntropywhichisaversionofSpectralEntropy.ThespectrumisfirstmappedtotheMelscaleandthendividedintotwelvesub‐bandswithcenterfrequenciesthatcoincidewiththefrequenciesofthechromaticscale.Theenergyineachsub‐bandisthennormalizedbythetotalenergyofallthesub‐bands.Lastlytheentropyofthenormalizedspectralenergyiscalculatedas
€
E = − ni × log2(ni)i= 0
L−1
∑
(6)
whereniisthenormalizedenergyofsub‐bandiandListhenumberofsub‐bands.
ThefeatureModifiedLowEnergyRatio(MLER)isintroducedin[8].Thefeatureexploitsthefactthatmusicshowslittlevariationinenergycontourofthewaveform,whilstspeechshowslargevariationsbetweenvoicingandfrication.MLERisdefinedastheproportionofframeswithRMSpowerlessthanavariablethresholdwithinonesecond.Itissuggestedbytheauthorsthatthethresholdshouldbeintheinterval[0.05%,0.12%]forbestperformance.
In[4]afeaturecalledWarpedLPC‐basedSpectralCentroid(WLPC‐SC)isintroduced.ThefrequencyanalysisismappedtotheBarkscaleandthenthecentroidfrequencyiscomputedbyaone‐polelpc‐filter.Thisfeatureexploitsthefactthatspeechhasalowcentroidfrequencythatvarieswithvoicedandunvoicedspeech,whilstmusichasachangingbehavior.
2.2.6 Psychoacoustic pitch scales PsychoacousticscalesarecommonlyusedinMusicInformationRetrieval(MIR)systems.Speechandmusicisoftenwelladjustedtoourearsandthereforehavemostinformationinthefrequencieswhereourearshavethebestresolution.ThemostusedscaleisMel,butevenBarkandEquivalent
Automaticspeech/musicdiscriminationinaudiofiles
11
RectangularBandwidth(ERB)aresometimesused.Theoneusedinthispaper,Melfrequency,isdefinedas
€
f =1127.01048 × log f1700
+1
(7)
wheref1istheoriginalfrequency[1].
2.2.7 Extracting features Usuallyachosensetoffeaturesisextractedfromeachframeoftheaudio.Thefeaturesarethenoftennormalizedbythecomputedmeanvalueandthestandardvariationoveralargertimeunitandthenstoredinafeaturevector.
Featuresareusedintwoways,eitherbyusingtheextractedvalueorbyusingchangesovertime.Whenusingthechangesovertimeitispossibletocalculatestatisticalfeatureslikevarianceandstandarddeviation.In[5]itisshownhowusingchangesovertimearemoreaccuratethanonlyusingtheabsolutevaluesofthefeatures.
Onlyonefeatureisusedin[1],[4]and[7],althoughtheyalluseadvancedspecialfeaturesdescribedearlier.Othershavechosentohaveasetofstandardfeatures.In[3]fivedifferentfeaturesareused,energy,ZCR,SpectralEntropyandthetwofirstMFCCs.RMSandZCRareusedin[2].
2.3 Segmentation In[1]and[4]aregiongrowingtechniqueisusedforthesegmentationstep.Thistechniqueiswidelyusedinimagesegmentation,butcanalsobeusedforaudio.Anumberofframesareselectedasseeds.Thefeaturevectorsoftheseedsarethencomparedtotheframesnexttoit.Thesegmentthengrowswiththeneighboringframesaslongasthedifferenceinthefeaturesdoesnotexceedapredefinedthreshold.
Othersystemslikein[2]lookforbigchangesbetweentwoneighboring1secondframes.Thetwoneighboringframesfeaturevectorsarecompared,andiftheyaresufficientlydifferent,asegmentborderisdetected.Whenaborderisdetectedinaframethetransactionismarkedwithintheframewithanaccuracyof20ms.
2.4 Classification methods Whenthesegmentationprocessisdoneeachsegmentshouldbeclassifiedeitherasspeechormusic.Inamorecomplexsystemasin[3]moreclassescanbedefined,suchassilenceorspeechovermusic.Thelatterisoftenclassedasspeechinsystemswithonlythetwobasicclasses.
Theextractedfeaturevectorisusedtoclassifyeachsegment.Ameanvectoriscalculatedforthewholesegmentandisthencomparedeithertoresultsfromtrainingdataortopredefinedthresholds.
Automaticspeech/musicdiscriminationinaudiofiles
12
Amethodwheretheclassificationisbasedontheoutputofmanyframestogetherisproposedin[10].Eachsecondconsistsof50frames,andeachframeisassignedaclassbyaquadraticGaussianclassifier.Then,aglobaldecisionismadebasedonthemostfrequentlyappearingclasswithinthatsecond.
2.4.1 Hidden Markov Models HMMsarecommonlyusedforclassification.In[3]theyareusedtogetherwithaBayesianNetworkClassifier.Thefeaturevectorsequenceisusedasinputtothemodel.Themodelhasastateforeachclass.Onlytwoclassesareusedin[3],oneforspeechandoneformusic.Probabilityoftransfersbetweenstatesarecomputedonlearningdataandstoredinthemodel.
In[6]aHMMwith24statesisusedforspeech,andanothermodelisusedformusic.Theuseofthreeorfourstatescouldcorrespondtosomephonemeclasses,insteadofhavingonlyoneclassforspeech.
Differentmethodsareusedtotrainclassificationmodels.TheViterbialgorithmisusedforHMMtrainingin[11]andtheBaum‐Welchalgorithmisanotheralternativeforthesystem’slearningprocess.
2.4.2 Refined classification Amethodtorefinetheresultsoftheclassificationmethodisdescribedin[8].Fourstatesareused,oneforspeech,oneformusic,onefortransitionstomusicandthelastfortransitionstospeech.Iftheclassifieroutputsamusicsegmentwhilstthesystemisinthespeechstate,thesegmentwillbestoredinastack.Iftheclassifierkeepsoutputtingmusicsegmentsforasettimethestatewillchangetomusic,andallthesegmentsonthestackwillbeclassedasmusic.Butiftheclassifieroutputsaspeechsegmentwithinthattime,thesystemwillgobacktothespeechstateandallthesegmentsonthestackwillbeclassifiedasspeech.Theaccuracyoftheclassificationisreportedtoincreaseby6.5%percentwhenthismethodisused.
Refinementcanbedoneinbothonline‐andofflinesystems,butcanbemademoreefficientinthelatterwherenodemandsonreal‐timeoutputarepresent.Therefinementtechniquedescribedaboveneedsasetnumberofsegmentstoperformwell.Whenthesesegmentsarefewertheperformancewilldecreasedrastically.
2.5 Evaluation methods Theresultsareoftenevaluatedasin[3]withthemeasuresrecall,precisionandtheoverallaccuracy.In[3]recallisdefinedastheproportionoftheframeswithaspecificclassthatwerecorrectlyclassifiedandprecisionisdefinedastheproportionoftheframesclassifiedasaspecificclass,thatactuallybelongedtothatclass.Atotalaccuracyisthencalculatedasthetotalpercentageofcorrectlyclassifieddata.
Automaticspeech/musicdiscriminationinaudiofiles
13
Systemscanbeoptimizedforeithermusicorspeechtoraisetheprecisionofthatspecificclassinthesystem,althoughthisoftendecreasestheaccuracy.
2.6 Earlier results ComparingearliersystemsforSMDtasksisnoteasy.Thereisnostandarddatabaseforevaluatingthem,likethereisforspeakerandspeechrecognitionsystems.Thismakesithardtoactuallyranktheexistingsystems.Mostarticlesreportsystemswithaccuraciesabove90%andin[9]anaccuracyashighas98%isreported.
Automaticspeech/musicdiscriminationinaudiofiles
14
3 Evaluation and tests Thischapterpresentsthetestsdonewithinthiswork.Theresultsofthetestsareevaluatedinordertobeabletofindthebestmethodforthespecifictask.Thetestphasewasdoneinthreesteps:Featureandclassificationtests,segmentationtestsandtestsofthecompletealgorithmwherebothclassificationandsegmentationwereevaluated.Thealgorithmtestsarepresentedattheendofchapter4.
3.1 Test database Atestdatabasewasneededtorunallthetests.Unfortunately,thereisnostandardizedtestdatabaseforSMDtasks,whichmakesthetestresultshardertocompare.However,inthisspecificcasethetestdatabasewasconstructedofmaterialfromSwedishRadiobroadcaststomatchtheactualmaterialthatwillbeusedasinputwhenthealgorithmisimplementedinaproductionenvironment.
MaterialwasselectedfromSwedishRadio’sdigitalarchiveDigastocoverallkindsofgenres.Thematerialismadeupofwholeprogramsthatwereairedonradioandwasselectedinordertogetawidespreadofincludedmaterial.
Threesetsoftestaudiowereselectedandextractedformthematerial.Thefirstsetconsistedof30secondslongaudiofilescontainingonlyspeechoronlymusic.Thespeechexamplesincludedfemaleandmalevoices,interviews,phoneinterviews,sportcommentaryandmore.Themusicexamplesrangedovermanygenresfromdifferentprograms.Thesewerethenusedforthefeatureteststobeabletovalueeachfeatureandcalculatethecorrelationbetweenthefeatures.Thesecondsetconsistedof1minutelongaudiofilescontainingbothspeechandmusicandwithatleastonetransitionbetweenclasses.Thefileswereselectedsothatthetransitioncanbeanywherewithinthefile.Thesewerethenusedforthesegmentationtests.Thethirdandlastsetconsistedofwholeprogramscontainingbothmusicandspeechsegments.Thelengthofthesefilesvariedfromacoupleofminutesupto90minutes.Thesewerethenusedtotestthecompletealgorithm.
3.2 Tools Audacity[15]wasusedtoeditandconvertaudiofilestoconstructthetestdatabase.
SonicVisualiser[16],togetherwithvariousVampplugins[17],hasbeenusedforearlyanalysisofthetestdatabase.SonicVisualiserisdevelopedbytheCentreforDigitalMusic,QueenMaryUniversityofLondon[18]andiseasytousetogetaquicklookathowvaluesoffeatureschangesinaudio.
MATLABhasbeenusedfortestingandevaluatingduringthewholeprocess.ManyofthefeatureshavebeenextractedusingtheMIRToolbox[19],
Automaticspeech/musicdiscriminationinaudiofiles
15
developedbytheUniversityofJyväskylä[20]inFinland,andtherestofthefeatureswereextractedusingcustomwrittencode.
ThefinalalgorithmiswritteninCcodebecauseofitseffectivenessandspeed,usingthelibsndfilelibrary[13]forreadingwavefiles.
3.3 Feature tests Thefirsttestwastocheckiftheselectedfeatureshaveanysignificancefortheclassificationtask.ThetestedfeatureswerechosenfromearlieralgorithmsthatperformedtheSMDtaskwithgoodresultsandafewwereaddedbasedonpersonalhypothesizes.
Someinitialtestswerealsodonewithotherfeatureslikeflux,otherrhythmfeaturesanddifferentusesofMFCC,howeversincethesefeaturesdidnotshowanypotentialfortheSMDtasktheywerenotincludedinthesetests.
Intheseteststhefeaturesweretestedfortheclassificationpurposeandthecorrelationbetweenfeaturesweremeasured.Thechosenfeatureswereextractedfromthetestmaterialinthefirstset,containingonlyoneclasstogetadistributionofvaluesforeachclass.Mostfeaturesweretestedinfivedifferentways,theextractedvalue,thevariance,thestandarddeviation,thederivativeandthestandarddeviationofthederivative.Boththestandarddeviationandthevariancederivefromthesamedatasincethevarianceisthesquareofthestandarddeviation.Becauseofthis,onlythestandarddeviationresultsarepresented.Histogramsoftheresultswerethengeneratedtovisualizethedistribution.Featuresthatshowedinterestingresultsarefurtherdiscussedlaterinthischapter.
ThetestedfeatureswereRMSamplitude,ZeroCrossing‐Rate(ZCR),MelFrequencyCepstrumCoefficients(MFCC),SpectralCentroid(SC),PulseClarity(PC)andModifiedLowEnergyRatio(MLER).Theirabbreviationswillbeusedduringtherestofthischapter,togetherwithanabbreviationforthewaythefeatureisusedusingthefollowingnamingconventions.
SD StandardDeviationD DerivativeSDD StandardDeviationofDerivative
ThismeansthatthestandarddeviationoftheZeroCrossing‐RatewillbeabbreviatedZCR|SD.Atotalof29differentfeaturevariationsweretestedinthesetests.
Automaticspeech/musicdiscriminationinaudiofiles
16
3.3.1 Feature test results
Feature Framelength AccuracyRMS 20ms 0.639RMS|SD 1s 0.829RMS|D 1s 0.548RMS|SDD 1s 0.764ZCR 20ms 0.588ZCR|SD 1s 0.837ZCR|D 1s 0.550ZCR|SDD 1s 0.835SC 1s 0.581SC|SD 30s 0.792SC|D 30s 0.589SC|SDD 30s 0.958MFCC1 20ms 0.517MFCC1|SD 1s 0.548MFCC1|D 1s 0.525MFCC1|SDD 1s 0.553MFCC2 20ms 0.521MFCC2|SD 1s 0.548MFCC2|D 1s 0.525MFCC2|SDD 1s 0.553MFCC3 20ms 0.521MFCC3|SD 1s 0.548MFCC3|D 1s 0.525MFCC3|SDD 1s 0.553MFCC4 20ms 0.519MFCC4|SD 1s 0.548MFCC4|D 1s 0.525MFCC4|SDD 1s 0.553MLER 1s 0.969PC 5s 0.807
Table1.Featuretestresults.Framelengthisatimemeasureandaccuracyisshowninpercentage/100.
ResultsofthefeaturetestsareshowninTable1above.Theaccuracyoftheeachfeatureisameasureofhowfarthespeechandthemusicdistributionarefromeachother.Thiswascalculatedbyfindingthethresholdvaluewiththeleastmisclassificationsbytestingallthresholdvaluesinanintervalspecifiedforeachfeature.Thenumbersofmisclassifiedframesforthebestthresholdwerethencountedanddividedbythetotalnumberofframesandthentheresultwassubtractedfrom1.
€
A =1− min(misclass)nFrames
(8)
Featureswithaccuracyresultscloseto50%canbeconsideredasrandom,containingnousefulinformationfortheclassification.Thethreshold
Automaticspeech/musicdiscriminationinaudiofiles
17
optimizationdescribedaboveisthereasonthatallfeaturesperformedover50%.
ThefivefeatureswithhighestaccuracieswastheModifiedLowEnergyRatio(97%),thestandarddeviationofthederivativeoftheSpectralCentroid(96%),thestandarddeviationoftheZeroCrossing‐Rate(84%),thestandarddeviationoftheRootMeanSquare(83%)andthePulseClarity(81%).Surprisingly,noneoftheMFCCfeaturesshowedanyusefulresultsinthesetestandgottheworstresultsofallfeatures.TheextractedvaluesforeachMFCC‐parametersgavejustabove50%accuracieswiththeoptimalthreshold.Sincesmallvariationscandependontherathersmalltestdatabase,theMFCC‐featuresthemselvescanberegardedasuselessforthesekindsofclassificationtasks.Thereasonthatitgotover50%isalsobecausethebestthresholdvalueissought,andthereforegivesahigheraccuracy.However,MFCCfeaturescanstillbeusedinotherwaysdiscussedlaterinthispaper.
Thecorrelationbetweenthefourtopfeaturescannotbecalculateddirectlysincetheyusedifferentlengthofanalysisframesandthereforegeneratedifferentnumberofdatapoints.TheMLERfeatureusesthepausesinspeechtodiscriminatebetweentheclassesandsodoestheZCR,sotheywillshowhighcorrelationbetweenthem.Unfortunatelyallfourfeatureshaveproblemwiththesamekindofaudio,namelyspeechwithbackgroundnoiseofsomekind,whichareoftenclassifiedasmusic.Thecorrelationbetweenthetopfourfeatureshasbeencalculatedbycomparingthemeanvaluesforeachfeatureforeveryfileinthefirstsetinthetestdatabase.
MLER SC|SDD ZCR|SD RMS|SD PCMLER ‐ SC|SDD 0.93 ‐ ZCR|SD 0.97 0.97 ‐ RMS|SD 0.97 0.93 0.97 ‐ PC 0.87 0.77 0.77 0.83 ‐Table2.Cross‐correlationofthetopfourfeaturesfromtheaccuracytests.
AsseeninTable2,allthetopfourfeaturesarestronglycorrelated.ThePulseClarityfeatureshowstheleastcorrelationwiththeotherswhilethestandarddeviationoftheZeroCrossingRateshowedashighcorrelationsas97%withboththeModifiedLowEnergyRatioandthestandarddeviationofthederivativeoftheSpectralCentroid.
3.3.2 Analysis of Low-level features TheLow‐levelfeaturesareinterestingbecausetheyincurrelativelylowcomputationcosts.However,bothRMSandZCRshowedweakresultsinthefeaturetests.RMSperformedsomewhatbetterwhichcanbeseeninthehistogramsbelow,wherebigdifferencesbetweentheredandbluegivesgooddiscriminationpossibilities.BothRMSandZCRwhereextractedforeveryframe.SpeechtypicallycontainsmanyframeswithRMSvaluesclosetozero,whilemusichasalmostnozeroenergyframesandapeakatabout0.4.The
Automaticspeech/musicdiscriminationinaudiofiles
18
closetozeroenergyframesinspeechrepresentsthepausesbetweensyllablesthatalwaysexistsinspeechrecordingswithoutbackgroundsound.TheZCRvaluesarecenteredroundthesamemeanvalue,butmusichasahigherpeak,i.e.alowerstandarddeviation.ThisisalsowhatgivesthegoodresultsforZCR|SDandZCR|V.
Figure9.Histogramsoffeaturevalues.Left:RMS.Right:ZC.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredaroundtheX‐values.
3.3.3 Analysis of MFCC TheMFCCfeatureshavebeenfrequentlyusedinMIRalgorithms.However,asseeninthefigurebelowalltheMFCCvaluesarecenteredroundthesamevalues,butwithsomedifferencesinthestandarddeviations.Asseeninthetestresultsthesevariationsofthefeaturesalsoscoredhigherthantheextractedvalues,butstillwithverylowtopscores.
Automaticspeech/musicdiscriminationinaudiofiles
19
Figure10.Histogramoffeaturevalues.Topleft:MFCC1.Topright:MFCC2.Bottomleft:MFCC3.Bottomright:MFCC4.Blueshowsmusicandredshowsspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.
3.3.4 Analysis of Modified Low Energy Ratio AsseeninTable1,theabsolutevaluesofthefeaturesareoftennotagoodwaytodiscriminatebetweenthetwoclasses.AbetterwaymightbetousethefeaturechangesovertimeasdonebytheMLERfeature.
Figure11.MLER.Left:Thresholdoptimization,Y‐axisshowscorrectlyclassifiedframepercentageandX‐axisshowsthethreshold.Right:Withthreshold0.26.Blueshowsmusicandredshowspeech.Y‐axisisthenumberofoccurrencesofthebinscenteredroundtheX‐values.
Automaticspeech/musicdiscriminationinaudiofiles
20
Figure11showstheModifiedLowEnergyRatioasdescribedearlierinthereport.Whenusingathresholdforlowenergyat0.26ofthemeanRMSvalue,theclassificationachieved97%accuracy.IftheMLERequalszerotheframeisclassifiedasmusic,otherwiseitisclassifiedasspeech.AlmostallthespeechframeswithzeroMLERcomefromsportcommentarywithanaudienceinthebackground.Theleftgraphshowstheoptimizationofthetotalerror‐rate.Insomeapplicationsitmightbebettertooptimizeitforeitherspeechormusic.
3.4 Segmentation tests Thesegmentationtestsweredoneontheaudiofilescontainingatleastonetransitionbetweenclasses.Theexactpositionsofthetransitionswithinthefileswerenotedasagroundtruthforthetests.Thesepositionsarenotabsoluteandasmalltestshowedthatdifferentpeoplewouldplacethesepositionsdifferently.Sometimesatransitioncanbeuptohalfasecondlongsegmentofsilence,anditwouldbeacceptableforthesegmentationtoplacethetransitionanywhereinthissilence,sincesilencewasnotdefinedaseitherspeechormusic.Forthesetworeasons,adeviationof100mswasallowedandcountedasacleanhit.Adevianceof0.1‐1secondcountedasahit,andthedistancefromtheborderofthecleanhitsegmentwascalculated.Ifthesegmentationmarkedatransitionmorethan1secondawayfromthegroundtruth,thiscountedasamiss.
Threemeasureswerethenusedtoevaluatethesegmentationtechniques:
Hitefficiency: Ameasureofhowmanyhitswerefound.Themissesissubtractedfromthehits,andthedifferenceisthendividedbythetotalnumberoftransitions.
€
Hiteff =Hits−MissesTransitions
(9)
Hitaccuracy: Anaccuracymeasurewhereonlythehitsareconsidered.Allthedistancesareaddedandameanvaluewascalculatedbydividingbythetotalnumberofhits.Acleanhitcountsaszerodistance.
€
Hitacc =
Hitposn −Transposnn=1
N
∑N
(10)
WhereNisthenumbersofhits.
Hitrate: Asimplemeasurewhereonlythehitsareconsideredonceagain.Thetotalnumberofhitsisdividedbythetotalnumberoftransitionsinthetestfiles.
€
Hitrate =Hits
Transitions
(11)
Automaticspeech/musicdiscriminationinaudiofiles
21
Noneofthesemeasuresconsidersthecomputationtimes:Thesearediscussedinthetestevaluationsbelow.
Inthefirstpartofthetest,theregiongrowingtechniquewastestedagainsttheneighboringdifferencetechnique,bothdescribedin2.3.Bothtechniquesweretestedwiththesamefeatures.
Technique Hitefficiency Hitaccuracy HitrateRegiongrowing 81% 79ms 100%Neighboringdifference 94% 44ms 99%Table3.Segmentationtechniquetestresults.
AsseeninTable3,theneighboringdifferencetechniqueachievedbetterresultsatboththeHitefficiencyandtheHitaccuracy.TheHitratemeasureishardertoevaluate,butresultsascloseto100%aspossiblearegood.Theneighboringdifferencetechniquedidnotdetect1%ofthehits,butascanbeseenintheHitefficiencymeasure,didnotdetectasmanymisseseither.AchoicewasmadetopursueonlytheNeighboringdifferencetechniquebecauseofitsefficiencyandaccuracy.
Feature Hitefficiency Hitaccuracy HitrateMLER 95% 19ms 99%SC|SDD 93% 70ms 99%ZCR|SD 88% 24ms 99%RMS|SD 87% 27ms 99%PC 88% 72ms 99%Table4.Segmentationfeaturetestresults,showingthetop5features.
Inthesecondtestallfeaturestestedinthefeaturetestsweretestedfortheneighboringdifferencetechnique.Thesamekindoffeaturesperformedwellalsointhesetests,althoughfeatureswithlowanalysiswindowsandshortframesperformedbetterandshowedbetterHitaccuracyresults.Table4showsthatthetop4featuresalldetected99%ofthetransitions,buttheMLERfeatureachievedthebestresultsbothfortheHitefficiencyandtheHitaccuracymeasures.FurthertestswerethendonewiththeMLERfeature,andboththeefficiencyandtheaccuracyimprovedwhenusingshorterframelengthsandinsteadusedacombinationofthemeanRMSandthevarianceofRMS.AHitefficiencyof96%andaHitaccuracyof17mswerethenachieved.
3.5 Design criteria Therearemanyaspectstoconsiderwhenchoosingfeaturestouseinthealgorithm.Thetestresultsneedtobethoroughlyanalyzedregardingtheseaspects:
• Accuracyofthefeaturetobeabletodiscriminatebetweenthetwoclasses,neededforbothclassificationandsegmentation.Idealfeatureshavesimilarvalueswithinoneclassandacleardifferencefromotherclasses.
Automaticspeech/musicdiscriminationinaudiofiles
22
• Computationalcosts.Costlyoperationsshouldbeavoidedtomakethediscriminationtaskefficientandcheap.Evenofflinesystemsneedtobefasttoimprovethecostefficiency.
• Correlationbetweenchosenfeatures.Theuseoftwodifferentfeaturesneedstobemotivatedbyimprovedresults.Iftwofeaturesarehighlycorrelated,theimprovementsinaccuracywillberelativelycostly.
• Insensitivitytonoiseintheinputsignal.
TheMLERfeaturewaschosenforitsexcellentaccuracyandbecauseofitscombinationwiththeneighboringdifferencesegmentationtechnique.Boththesegmentationandtheclassificationarebasedononlyonesinglelowlevelfeature,theRMSamplitude.Thismakesthecomputationalcostsverylow.Sincethehighperformingfeatureswereallhighlycorrelateditwasnotmotivatedtoaddanotherfeature.Theimprovementsinaccuracywillbecostly,andanotherclassificationmethodwillbeneeded.WhentheRMSamplitudeisnormalizedoverthefilemaximumitisalsoinsensitivetosoundlevels,butthebackgroundnoisewillstillbeaproblemintheclassification
Thethreecalculatedmeasuresfromthesegmentationtestswhereconsideredwhenchoosingthesegmentationmethod.TheHitratewasconsideredthemostimportant,sincemisses(transitiondetectionswherenotransitioniswithinonesecond)canbediscardedinanalgorithmusingsegmentationrefinementmethods.TheHitefficiencywasonlyconsideredwhentwotestsshowedcommonresultsintheHitrate.Thecomputationtimestoo,wereconsideredwhenchoosingthefinalmethodforthealgorithm.
Theneighboringdifferencetechniqueshowedbetterresultsforallfeaturesandwasthereforethestrongestcandidate.ItisalsoagoodmatchwhentheMLERisusedfortheclassificationtasksinceitachievedgoodresultsinHitaccuracywhenusingRMS.Computationtimesarelargelyreducedsinceboththeclassificationandthesegmentationwillusethesameextractedfeatures.
Automaticspeech/musicdiscriminationinaudiofiles
23
4 Algorithm ThischaptercontainsadetaileddescriptionofthefinalalgorithmthatwascodedanddeliveredtoSwedishRadio.ThealgorithmiswritteninCcodeandusesthelibsndfilelibrary[13]toreadandwritetowavefiles.Thediscriminationisdoneonaudiofilesandhencethisisanofflineprocedure.
4.1 Signal preprocessing Beforeanythingisdonewiththeaudio,afewpreprocessesareperformedontheaudiosignal.
Thereisnoaddedinformationinthedifferenceoftwochannelsthatcanbeusedfortheclassificationorthesegmentation.Thereforeitisdesirabletohaveamonosignaltosimplifylaterprocesses.Thealgorithmchecksthenumberofchannelsoftheaudio.Ifthesignalhasmorethanonechannel,itismixeddowntomono.
Theamplitudeofthesignalisthennormalizedtothemaximumamplitudeofthewholefiletoremoveanyeffectstheoverallamplitudelevelmighthaveonthefeatureextraction.
4.2 Feature extraction Aftertheaudiosignalhasgonethroughthepreprocessingpart,itissplitinto21msnon‐overlappingframes.TheRMSamplitudeisthencalculatedforeachframeusingequation1.
OncetheRMSamplitudehasbeencalculated,theframesaregroupedtogethertoform1second(48shortframes)analysisframes.Thesearealsonon‐overlapping.FourfeaturesbasedontheRMSamplitudevaluesarethenextractedfromeach1secondframe,meanRMS,varianceofRMS,alocallynormalizedvarianceoftheRMSandaModifiedLowEnergyRatio(MLER).ThenormalizedvarianceisthevarianceofRMSdividedbythemeanRMS.
TheModifiedLowEnergyRatioistheproportionoflowenergyshortframeswithinthe1secondframe.ThethresholdforlowenergydependsonthemeanRMSamplitude.ThemeanRMSamplitudeintheanalysisframeismultipliedwithapredefinedvalueddefinedbythetestresults.SpeechcontainsmanysmallpausesbetweensyllablesandwordsandthereforehasahigherMLERthanmusic.Thisisusedlatertoclassifyeachsegment.
4.3 Segmentation Thetaskforthesegmentationpartofthealgorithmistofindtheexactpositionoftransitionsbetweentwoclasses.Thesegmentationisbased,justastheclassification,onthefeaturesextractedearlier.Every1secondframeis
Automaticspeech/musicdiscriminationinaudiofiles
24
examinedtolookforcandidatesfortransitionframes,andthenanexactpositionofthetransitionisfound.
Thisisamodifiedversionofthesegmentationdonein[2].Theadvantageofthismethodisthatitusesthesamebasefeature,RMSamplitude,astheclassification.Thismeansthatnofurtherfeatureextractionisneededandthissavescomputationtimeandminimizesthereadingoftheaudiofiles.TheRMSamplitudeisusedinanotherway,sincetheMLERfeatureusedforclassificationrequireslongeranalysisframes,whilethemeanandvarianceoftheRMSchangesasquicklyasdotheaudioclasses.
Thefirststepisdonebylookingattheframebeforeandtheframeaftertheexaminedframe.Ifthetwoneighboringframesaredifferenttheexaminedframeislikelytohaveatransition.Thetransitioncanbeanywherewithinthatsecondandtheexactpositionisdeterminedinthenextstep.Aproblemwilloccuriftheclasschangestwotimeswithintheexaminedframe.Thenthetwoneighboringframeswillnotdifferenoughtobechosenasacandidatefortransitionframes.Although,thesekindsoferrorsarenotcorrectedinthesegmentationsincesegmentssmallerthan2secondswillberemovedlater.
Figure12.Waveformwithtransitionfromspeechtomusic.Secondsaremarkedwithverticallines.
ThecomparisonbetweenneighboringframesarebasedonthemeanandthevarianceoftheRMSvalues.SincethedistributionoftheamplitudeoftheaudiosignalsisaLaplaciandistributionasshownin[2],aprobabilitydensityfunctionofthe
€
χ 2distributionisused,definedas
€
p(x) =xae−bx
ba+1Γ(a +1)
(12)
wherex≥0,Γisthegammafunctionandthetwoparameters,aandb,aredefinedas
€
a =µ2
σ 2 −1and
€
b =σ 2
µ
(13)
whereμisthemeanRMSandσ2istheRMSvariance.Thesimilaritymeasureisbasedontheprobabilitydensityfunction
€
p p1, p2( ) = p1 x( )p2 x( )dx∫ (14)
wherep1andp2referstotheprobabilityfunctionsofeachframe.Whenthe
€
χ 2distributioninequation(12)isinsertedinequation(14)thisgivesasimilaritymeasurecalculatedwith
Automaticspeech/musicdiscriminationinaudiofiles
25
€
p(p1, p2) =Γa1 + a22
+1
Γ a1 +1( )Γ a2 +1( )2a1 +a22
+1b1a2 +12 b2
a1 +12
b1 + b2( )a1 +a22
+1
(15)
Sincethemeasureiscalculatedwithtwoframeswithoneframebetween,theexaminedframei,thedissimilaritymeasureisdefinedas
€
Dissim i( ) =1− p pi−1, pi+1( ) (16)Thiswillgivehighprobabilitiesofchangeevenforthesurroundingframes.AsseeninFigure12,outofthefivesecondsmarkedbytheverticallinessecond2and4willdiffermostinmeanandvarianceofRMS.However,second3and5willalsogivehighvaluesinthedissimilaritymeasure.Todampentheeffectofthiserror,afilterneedstobeapplied.Thesimilarityvalueisthereforelocallynormalizedover5secondswiththeexaminedframeinthecentre.Thenormalizingiscalculatedby
€
Dissimnorm (i) =Dissim(i) × D(i) − D(i − 2) + ...+ D(i + 2)
5
max(D(i − 2),...,D(i + 2))
(17)
Thedissimilaritymeasureismultipliedbyapositivedifferenceofthemeanoftheneighborhood.Ifthedifferenceisnegative,itissettozero.Thisisthendividedbythemaximumvalueoftheneighborhood.
Athresholdforthenormalizeddissimilarityvalueisthensetaccordingtheresultsofthetestmaterial,todeterminewhichframesareselectedascandidatesfortransition.ThethresholdisvariableanddependsonthevarianceoftheRMSintheneighboringframes.
Whenthecandidatetransitionframeshasbeenchosen,anexactpositionforthetransitionneedstobefound.Thisisdoneinawaysimilartothelaststep.Forevery20msframethepreviousandthenextonesecondsarecompared.Avalueisgivenforevery20msframefortheprobabilityofchange,andtheframewiththehighestprobabilityismarkedastheexactpositionofthetransition.
4.4 Classification OnlytheMLERfeatureisusedfortheclassificationpart.ApredefinedthresholdissetandallsegmentsthathaveahigheraverageMLERthanthethresholdareclassedasspeech,andeverythingbelowthethresholdisclassedasmusic.Theuseofonlyonefeaturereducesthecomputationtime.Thealgorithmneedsnotrainingmaterialtowork,sincethethresholdissetaccordingtotheresultsofthetestmaterial.
Automaticspeech/musicdiscriminationinaudiofiles
26
4.5 Refinement Simplerefinementsofthesegmentsaredoneaftertheclassification.Iftwoconsecutivesegmentsaregiventhesameclasstheyaremergedtogetherandthetransitionbetweenthemiserased.
4.6 Output Whenalltheprocesses,segmentation,classificationandrefinementsaredone,theresultsarereadytobeoutput.Thealgorithmthencreatesasimpletextfilewithonelineforeachsegment.Thelinecontainstheexactpositionofthestartofthesegmentandabinarynumbershowingtheclassofthesegment.Alinewouldlooklikethis
1–0.0000000‐9.7708331–27.098166
wherethe0standsforspeech(1formusic)and9.770833istheexactpositionofthetransitionmeasuredinseconds.Thepositionofthetransitionismeasuredinsecondsbecauseofeasierhandlingforotherapplications.Ifthepositionweremarkedwithanexactframe,thethirdpartyapplicationwouldalsohavetoknowthesamplefrequencyoftheaudio.Thetransitionframeiseasilycalculatedby
€
F = t * sr (18)wheretisthepositionofthetransitionmeasuredinsecondsandsristhesamplerateoftheaudio.
AFlashplayerthatusestheoutputdatatomarkthesegmentscanthenreadthetextfileandmarkthesegmentsinthenavigationbar,asseeninFigure13.Thiscouldbeusedasaguidewheneditingtheaudio,orsimplytolookforerrorsintheoutput.
Figure13.Flashplayerwithmarkedsegmentsinthenavigationbar.Greenshowsspeechsegmentsandwhiteshowsmusicsegments.
TheBroadcastWaveFormatcontainsamarkerchunk,whichcouldbeusedtomarkthetransitionpointsofthesegments.Thisisnotyetintegratedintheapplication,butcouldbedoneforspecificimplementations.
Automaticspeech/musicdiscriminationinaudiofiles
27
4.7 Results Thealgorithmtestswereperformedonthefinishedalgorithm.AudiofilescontainingfulllengthsprogramsfromSwedishRadiowereusedasinputandboththesegmentationandclassificationwereevaluatedatthesametime.Theresultingaccuracyisthepercentagewheretherightclassisfoundattherighttime.Thelengthofthecorrectlyclassedaudioisdividedbythetotallengthoftheprogram.
Speech MusicSpeech 95,4% 4,6%Music 1,9% 98,1%
Table5.Algorithmresults.Leftcolumnshowsinputandtoprowshowstheoutputofthealgorithm.
Theleftcolumnofthetableshowstheclassoftheinputtedaudio,andthetoprowshowstheclassoutputtedbythealgorithm.Speechreachesaloweraccuracybecauseofthesportcommentarysegmentsthatareoftenclassedasmusic,whilemusiciscorrectlyclassedasmusicasoftenas98,1%ofthetime.
Right WrongTotal 97,3% 2,7%
Table6.Summarizedresultsforallinputstothealgorithm.
Thisgivesatotalaccuracyof97,3%sincethetestmaterialcontainedmoremusicthanspeech.
Thecomputationtimeofthealgorithmvariesdependingontheformatoftheinputandthenumberofinputchannels.Thecomputationtimedidnotexceed1%ofthelengthoftheaudioinanyofthetestfiles.
Automaticspeech/musicdiscriminationinaudiofiles
28
5 Conclusions Thefinalalgorithmgivesanaccuracyofover97%inthetestsperformedwithmaterialfromSwedishRadio.Thismatchestheresultsreportedinearlierwork,yetwithoutanyadvancedfeaturesthatrequirelongcomputationtimes.97%isalsoenoughformostapplications.Insomeapplicationsthemisclassifiedaudiostillneedstobeconsidered.Sinceweknowfromthetestswhatkindofaudiothatgiveslowaccuracies,thiscanbedonebymanuallydiscriminatingthesefiles.
Agraphicalinterfacewherethesegmentsaremarkedcouldbeofgooduseformanualworkwiththeaudio.Thealgorithmdoesthediscriminationjob,buttheresultsmightstillneedrefinement.Thisistrueforanapplicationforautomaticeditingforpodmaterial,wheretheresultsofthediscriminationcanbeusedasaguide,butsomeeditinglikecross‐fadesisstillneededtogetagoodsoundingresult.
Offlinesystemsbenefitmostfromfastercomputationtimes.Real‐time,onlinesystemsstillhavetousethesamelengthfortheanalysiswindowtogathertheneededstatisticsandcanbenefitonlybybeingabletousemoreandmoreadvancedfeatures.Offlinesystems,ontheotherhand,canbothusemoreadvancedfeaturestoachievehigheraccuracyandatthesametimecomputefaster.
Automaticspeech/musicdiscriminationinaudiofiles
29
6 Future work Therearestillendlessfeaturesandfeaturecombinationstobetestedforthistask.Thefeaturestestedduringthisworkarestillquitesimple.Eventhemodifiedspecialfeaturesliketheoneusedinthealgorithmarebasedonlyononelowlevelfeature.FurthertestscouldalsobedonewithMFCCfeatures,eventhoughtheyshowedpoorresultsinthepresenttests.CombinationsofdifferentMFCCfeatureshavebeentestedinearlierwork.FurtherexploringrelationsbetweendifferentMFCC,suchasdistancesbetweenthesecondandthirdcoefficient,couldgivegoodresults.
Morecomplexfeatureslikepitchcurvesextractionhavebeendiscussedandtentativelytested,butthereisstillmuchtoexploreinthisarea.PitchfeaturesshouldbeagoodcomplementfortheMLERfeaturesincetheyworkondifferentaspectsofthesound.AlsofeaturesbasedonrhythmcouldbeagoodalternativetocomplementtheMLERfeature.
Addingsubclassestoboththemusicandthespeechclasseswouldbeusefulformanyapplications.SwedishRadiohasalreadyrequestedthepossibilitytobeabletodiscriminatebetweenmaleandfemalespeakers.Musiccouldbesplitintogenresub‐classesforfurtherdiscrimination.Thereisalreadysomeworkdonewithgenreclassification,butmoreresearchisneededtocreateausefulapplication.
Constructingalowcostreal‐timediscriminatorthatcouldbeinsertedincarradioreceiverscouldbeanotherkindofproject.Listeningincarsofteninvolvesnoisyenvironmentswherespeechneedstobeamplifiedmorethanmusictoincreasetheaudibility.Differentlisteningenvironmentsdemanddifferentamplificationsofspeech.Insuchanapplication,theprocessingcannotbedoneinthetransmitterandneedsbeprocessedinreal‐time.
Automaticspeech/musicdiscriminationinaudiofiles
30
7 Acknowledgements ThisworkhasbeendonewithmuchhelpfrombothSwedishRadioandtheRoyalInstituteofTechnology.Specialthanksto:
MysupervisoratSwedishRadio,BjörnCarlsson,forallthehelpwithtechnicalquestionsaboutradioandalsoforalltheencouragementduringthework.
MysupervisorattheRoyalInstituteofTechnology,AndersFriberg,forallthefeedbackandideas.Alsoforallthespecialknowledgeinfeatureextractionandclassification.
HasseWessmanandLarsJonssonatSwedishRadio,TechnicalDevelopmentforgivingmetheopportunitytodothisprojectandgivingmeaninspiringworkingenvironment.
TherestofthestaffatSwedishRadio,TechnicalDevelopmentforfeedbackandmotivation.
MyfriendJohnHäggkvist,whohashelpedmeoutduringthecodingofthealgorithm.
Automaticspeech/musicdiscriminationinaudiofiles
31
8 References [1] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Acomputationally
efficientspeech/musicdiscriminatorforradiorecordings”,inUniversityofVictoria,ISBN:1‐55058‐349‐2,pp.107‐110,2006.
[2] Panagiotakis,C.&Tziritas,G.“ASpeech/MusicDiscriminatorBasedonRMSandZeroCrossings”,IEEETransactionsonMultimedia,vol.7(1),pp.155‐166,Feb.2005.
[3] Pikrakis,A.,Giannakopoulos,T.&Theodoridis,S.“Speech/MusicDiscriminationforradiobroadcastsusingahybridHMMBayesianNetworkarchitecture”,inProc.ofthe14thEuropeanSignalProcessingConference(EUSIPCO‐06),September4‐8,2006,Florence,Italy.
[4] Muñoz‐Expósito,J.E.,Garcia‐Galán,S.,Ruiz‐Reyes,N.,Vera‐CandeasP.&Rivas‐Peña,F.“Speech/MusicdiscriminationusingasinglewarpedLPCbasedfeature”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2005.
[5] McKinney,M.F.&Breebart,J.”FeaturesforAudioandMusicClassification”,inProceedingsoftheInternationalSymposiumonMusicInformationRetrieval,2003.
[6] Karnebäck,S.”Speech/MusicDiscriminationUsingDiscreteHiddenMarkovModels”,TMH‐QPSR,KTH,Vol.46,41‐59,2004.
[7] Alnabadi,M.S.”RealTimeSpeechMusicDiscriminationUsingASingleFeature”,DurhamUniversitySchoolofEngineering,2007.
[8] Wang,W.Q.,Gao,W.,Ying,D.W.“AFastandRobustSpeech/MusicDiscriminationApproach”,InProceedingsoftheInternationalConferenceonInformation,CommunicationsandSignalProcessing,2003.
[9] Saunders,J.“Realtimediscriminationofbroadcastspeech/music”,inProc.IEEEIntern.Conf.onAcoustics,SpeechandSignalProcessing,1996.
[10] El‐Maleh,K.,Klein,M.,Petrucci,G.&Kabal,P.“Speech/Musicdiscriminationformultimediaapplications”,ICASSPOO,2000.
[11] Ajmera,J.,McCowan,I.&Bourlard,H.“Speech/MusicsegmentationusingentropyanddynamismfeaturesinaHMMclassificationframework”,inSpeechCommunication40,pp.351‐363,2003.
[12] SwedishRadiowebpage,http://www.sr.se/sida/default.aspx?ProgramId=2438.Retrieved6/12010.
[13] Libsndfilewebpage,http://www.mega‐nerd.com/libsndfile/.Retrieved6/12010.
[14] Burred,J.J.,“AnObjectiveApproachtoContentBasedAudioSignalClassification”,TechnischeUniversitätBerlin,2003.
[15] Audacitywebpage,http://audacity.sourceforge.net/.Retrieved24/22010.
[16] SonicVisualiserwebpage,http://www.sonicvisualiser.org/.Retrieved24/22010.
[17] Vamppluginswebpage,http://vamp‐plugins.org/.Retrieved24/22010.
Automaticspeech/musicdiscriminationinaudiofiles
32
[18] CentreforDigitalMusicwebpage,http://www.elec.qmul.ac.uk/digitalmusic/.Retrieved24/22010.
[19] MIRtoolboxwebpage,https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox.Retrieved24/22010.
[20] UniversityofJyväskyläwebpage,https://www.jyu.fi/en/.Retrieved24/22010.
[21] Leijon,A.“SoundPerception:IntroductionandExerciseProblems”.RoyalInstituteofTechnology,Stockholm,2007.