Draft - RWTH Aachen ICE · Modern process automation and the industrial evolution heading towards...

Hendrik Laux, Andreas Bytyn, Gerd Ascheid, Anke Schmeink, Gunes Karabulut Kurt, and Guido Dartmann

Learning-Based Indoor Localization for

Industrial Applications

Draft

Code repository at: github.com/HendrikLaux/sound-localization

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CF ’18, May 8–10, 2018, Ischia, Italy 2018 Association for Computing Machinery

https://github.com/HendrikLaux/sound-localization

mailto:[email protected]

Learning-Based Indoor Localization for Industrial ApplicationsHendrik Laux

ICE, RWTH Aachen UniversityAachen, Germany

[email protected]

Andreas BytynICE, RWTH Aachen University

Aachen, [email protected]

Gerd AscheidICE, RWTH Aachen University


Anke SchmeinkISEK, RWTH Aachen University


Gunes Karabulut KurtIstanbul Technical University

Istanbul, [email protected]

Guido DartmannTrier University of Applied Sciences

Trier, [email protected]

ABSTRACTModern process automation and the industrial evolution headingtowards Industry 4.0 require a huge variety of information to befused in a Cyber-Physical System. Important for many applica-tions is the spatial position of an arbitrary object given directlyor indirectly in terms of data that has to be processed to obtainposition information. Starting point for the idea of the technicalreflection-based sound localization system presented in this paperis the biological role model of humans being able to learn how tolocalize sound sources. Compared to other forms of sound localiza-tion, this nature-inspired method has no need for high spatial andtemporal accuracy or big microphone arrays. Possible applicationsfor this system are indoor robot localization or object tracking.

KEYWORDSMachine Learning for IoT, Sound Localization, Support Vector Ma-chines, Room Acoustics

ACM Reference Format:Hendrik Laux, Andreas Bytyn, Gerd Ascheid, Anke Schmeink, Gunes Karab-ulut Kurt, and Guido Dartmann. 2018. Learning-Based Indoor Localizationfor Industrial Applications. In CF ’18: CF ’18: Computing Frontiers Con-ference, May 8–10, 2018, Ischia, Italy. ACM, New York, NY, USA, 8 pages.https://doi.org/10.1145/3203217.3203227

1 INTRODUCTIONSimple and robust localization methods are a key-feature of In-dustry 4.0 applications. Typical applications are the localization ofgoods in a warehouse and the localization of smart transport robotsand drones within a building. To localize machines and goods insidebuildings multiple approaches exist, e.g., localization with radiowaves [1] or acoustic localization with microphone arrays [2]. Thispaper presents a novel, yet simple nature-inspired approach for thelocalization of objects within closed rooms.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, May 8–10, 2018, Ischia, Italy© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5761-6/18/05. . . $15.00https://doi.org/10.1145/3203217.3203227

In an experiment performed by Paul M. Hofman, Jos G.A. VanRiswick and A. John Van Opstal [3], a subject group was askedto localize sound sources in the dark, not causing much troubleas anticipated. After they were equipped with a little plastic stripin their outer ear the result was not nearly as accurate as it wasbefore. The falsification of the sound path from source to inner eardid change the received input in a way the test persons were notused to. After a few weeks of daily routine with the plastic strip,the subject group performed the same experiment again. Althoughthe result still was not as good as without the modification, theaccuracy increased notably compared to the first try. It follows thatthe localization of sound sources can be learned.

The ability to localize sound sources by ear is one of the mostimportant tasks for our sense of hearing. While the sense of sightstruggles in dark environments and identifying objects by touchingis limited to a very short range, the localization of sound sourceshas been an important ability for early mankind to survive in ahostile environment.

1.1 ContributionThis paper introduces the biological process of sound localizationand transfers the single steps into a technical model to obtain anew kind of localization system that is able to track an objectwith one or two microphones. This is achieved at the expenseof additional computational complexity due to the necessity oflearning the environment’s spatial acoustics by means of supportvector machines (SVM) and a principle component analysis (PCA).Our work is inspired by the experiment of Paul M. Hofman, JosG.A. Van Riswick and A. John Van Opstal [3] and the learning effectof the human brain in case of a modified outer ear channel.

A big advantage of localization utilizing sound waves lies in thespeed of sound to be about six orders of magnitude lower than thespeed of light that has to be dealt with in a radio wave localizationscenario. Compared to other known forms of sound localizationlike triangulation, our reflection-based method comes with lowerhardware requirements, both quantitatively and qualitatively, aswell as with a trade-off between training effort and the resultinglocalization accuracy. The developed system concept is evaluatedby acoustical channel simulations and a real world field test.

1.2 Related WorkFew researchers used machine learning for sound processing ormore specific, localizing sound sources with a single microphone.

https://doi.org/10.1145/3203217.3203227

https://doi.org/10.1145/3203217.3203227

CF ’18, May 8–10, 2018, Ischia, Italy Laux et al.

Ashutosh Saxena and Andrew Y. Ng [4] describe the approach ofusing an artificial pinna (outer ear) to distort sound direction de-pendent in a way its human role model does. Pinnas with a broadlyvarying impulse response for different directions of incoming soundare considered to be the most suitable ones. They further use a hid-den markov model (HMM), a form of a dynamic Bayesian networkpredicting probabilities for certain directions. What differs fromour approach is the system output. While [4] aims at providing thedirection of the incoming sound without providing informationabout the distance or the exact position, the method described inthis work actually tries to localize the sound source by determiningits absolute position in the room.

Separation between voice and background music is another suit-able task for machine learning in terms of processing audio signals.Reference [5] uses a convolutional deep neural network being ableto learn what ’vocal’ sounds are compared to instrumental back-ground noise.

Most approaches on sound localization still rely on the use of amicrophone array as exemplarily found in [6], [7] or [8].

2 FUNDAMENTALS2.1 Sound PropagationSound waves are usually longitudinal density and pressure fluc-tuations in a gaseous, liquid or solid medium. Their behavior isdescribed by the acoustic wave equation [9]:

∂2p(x , t)∂x2 − 1

cs 2∂2p(x , t)∂t2 = 0, (1)

where p(x , t) describes the sound pressure level for a certain timeand place. The solution of the partial differential equation (1) con-sists of two possible solutions f and д for the sound pressure of aplane wave propagating in +x or −x direction [9]

p (x , t) = f (x − cs t) + д(x + cs t), (2)

where cs is the speed of sound.

2.2 Room AcousticsImportant for the theoretical considerations in this work is thereflection of sound waves. Sound propagating in the direction of anobstacle is either reflected or absorbed, where absorption means theconversion of mechanical oscillation energy into heat. Assuminga plane sound wave propagating vertically to a wall at x = 0,reflection is described by the complex reflection factor

r = |r | · eiγ = pR (x , t)pP (x , t)

(3)

given the ratio between propagating pP (x , t) and reflected pR (x , t)sound pressure in magnitude |r | and phase γ assuming steady stateconditions.

Furthermore, the absorption rate is defined as the ratio betweenincoming sound intensity and intensity that is not coming back:

α =not returning intensityincoming intensity

= 1 − |r |2 . (4)

In addition, sound waves propagating through the environment areinfluenced by diffraction, the phenomenon of new arising elemen-tary waves when sound reaches an obstacle or a slit according to

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

Time (seconds)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Am

plitu

de

Direct Sound

Early Reflections

Reverberation

Figure 1: Typical RIR (small room, high reflection factor)

the principle of Huygens-Fresnel, and shadowing, the absence ofsound waves behind a big obstacle.

Assuming all influences on the propagation of sound waves tobe linear and time-invariant, they are completely described by aroom impulse response (RIR), depending on the position of bothsound transmitter and receiver as well as on the geometrical andacoustical properties of the room. Convolving a sound with the RIRfor a certain source and recording position artificially provides thesound that a listener would hear at the receiving position comingform an ideal sound source at the senders position.

The frequency-domain representation of the RIR is called roomtransfer function (RTF). Both transfer function and impulse re-sponse can be easily transformed into each other by the (inverse)Fourier-Transformation (in case of a linear, time-invariant system)and therefore, contain the same amount of information. A schematicRIR is shown in Fig. 1. The direct, unreflected sound is the first toreach the listener, followed by the early, distinguishable reflectionsturning into a stochastic, fading reverberation.

It can be modeled as a series of K scaled impulses typically withdecreasing amplitude in time:

д(t) =K∑k=1

akδ (t − τk ), (5)

where ak is the magnitude for the impulse shifted by τk . The RIRis influenced by the room properties (size and wall’s reflectionfactor) as well as by the position of both sound emitter and receiver.It can either be obtained by room acoustic measurements or bysimulation. The simulations in this work use a special toolbox [11]for the numerical computing environment MATLAB to calculatethe discrete RIR д(k).

2.3 HearingIncoming sound waves from different locations are modified intime and frequency domain by certain body parts like shouldersand head, but also (especially regarding higher frequencies) bythe characteristic form of the outer ear (Pinna) and the ear canal(monaural cues) [12]. All these influences can be modeled as adirection-selective filter expressed by a family of head related trans-fer functions (HRTFs) that vary for every single position of soundtransmitter and receiver. Additional features exist if single HRTFs

Learning-Based Indoor Localization for Industrial Applications CF ’18, May 8–10, 2018, Ischia, Italy

for both ears are given (binaural transfer function). The interau-ral time difference (ITD) and the interaural level difference (ILD)provide information about sound delays and intensity differencesbetween both ears. As described by the duplex theory [13], ILDand ITD are the most important features for the human ability tolocalize sound sources on the horizontal plane.

2.4 LearningThe underlying principle of machine learning is the same as forthe learning process of every sophisticated form of life on earthwhich is decision making based on prior experience. A child touchinga hot stove and experiencing the consequences of its behavior willmost likely not touch hot surfaces again. A dog that is experiencingcertain occurrences during lunchtime will associate these with thepresence of food after a period of learning. This effect is knownas ’classical conditioning’ as it has been studied by Ivan Pavlov inhis famous experiment [14]. Learning as seen from the technicalpoint of view is a form of pattern recognition where the input datais mapped to one of two or more categories.

2.4.1 Notation. Our data set consists of N pairs (x1,y1), (x2,y2), ..., (xN ,yN ) of thep-dimensional input vectorx = (x1,x2, ...,xp ) ∈X and the associated target output (label)y ∈ Y . Assuming a patternto exist, there is an unknown target function f : X → Y perfectlymapping the input to the output space. To approach this unknownfunction, different kinds of hypotheses дi : X → Y are available.The process of learning is to find the unknown parameters of sucha hypothesis that are assumed to approximate the target functionbest according to a certain error criteria.

2.4.2 Forms of Machine Learning. Depending on what kind ofdata is available, different forms of Machine Learning are distin-guished [16]. Assuming the data’s associated labels to be unknownrequires to apply methods of unsupervised learning.

In this work, the position associated to all data samples is avail-able in the training phase. Thus, supervised learning is performedwhich requires the label of every data point to be known in thelearning process. Fig. 2 schematically shows the hyperplane of alinear supervised learning model separating two classes of labeleddata samples in the two-dimensional space. Once the model istrained and its hyperplane is known, every new data sample candirectly be allocated to one of the labeled classes. Typical tasksfor supervised learning are handwritten digit recognition or coinrecognition, where a lot of training data is available.

2.4.3 Generalization. Given an arbitrarily complex model, everypattern in a given data set can theoretically be learned. This carriesthe danger of overfitting by applying a much more complex modelthan it would be necessary for the given data. In this context, theterms of in-sample error and out-of-sample error [16] are helpfulto understand this problem. While in most cases the in-sampleerror, the percentage of misclassified data samples in the availabletraining set, can be decreased to zero by choosing a model withsufficient complexity, the out-of-sample error, which is the errorrate for the unseen data, can however rise given a higher modelcomplexity. The ability to cope with unseen (or out-of-sample) datais known as generalization [16]. In general, the model’s complexity

(a) Arbitrary Classifier (b) Large-Margin Classifier

Figure 2: Different Classifiers

should be as high as necessary, but as low as possible, even if the in-sample error does not reach zero in the training phase, to ensure agood generalization. Our approach uses simple, low-complex linearmodels for the classification part based on support vector machinesand their associated learning algorithms.

2.4.4 Support VectorMachines. Support VectorMachines (SVMs)are broadly applied models to perform supervised learning. Com-pared to other forms of supervised learning, they provide a so-calledlarge-margin classification (see Fig. 2(b)) combined with a simplelearning model both playing an important role for our localiza-tion system. With the distance of the nearest data points to thehyperplane being maximized in the classification process, the per-centage of misclassified samples not contained in the training set(out-of-sample error) is minimized providing a high grade of gen-eralization [16]. The essential steps to derive and understand thesupport vector machine are demonstrated below, for a more detailedderivation see [15] or [16].

We assume a linear model of the form

f (xn ) = wT xn + b (6)

with the labels yn associated to the data samples xn while

yn (wT xn + b) = 1 (7)

holds for the data point closest to the separating hyperplane. Max-imizing the margin or the distance of the nearest points to theseparating hyperplane results in the convex optimization problem(see [18]) given by:

maximizew,b

c =1

∥w∥subject to yn (wT · xn + b) ≥ 1 ∀ n = 1, 2, ...,N .

(8)

The distance to the nearest data points is maximized with thecondition of all training points to be classified correctly. Solvingthe Lagrange primal problem (see [18]) gives:

w =N∑n=1

λnynxn . (9)

Solving the associated Lagrange dual problem provides the dual vari-ables λ1, λ1, ..., λN . Most of them turn out to be zero. The Karush-Kuhn-Tucker condition of complementary slackness [18] demands

λn (yn (wT · xn + b) − 1) = 0, (10)

to hold for every n = 1, 2, . . . ,N . Since (7) only holds for the setof points nearest to the hyperplane, all λn associated to points xn


outside the margin have to turn zero. Looking back at Equation (9)with this knowledge it becomes clear, that data points outside themargin do not affect the weight vector at all unlike those touchingthe margin with a λn , 0. They ’support’ the separating hyper-plane. Thus, they are called support vectors. Given the trainedmodel, nothing more than the support vectors have to be stored toperform a classification afterwards as they completely describe theseparating hyperplane.

If the p-dimensional data set can be separated using a (p − 1)-dimensional hyperplane without producing any error, the data iscalled linearly separable [15]. The above derivation corresponds toa hard-margin linear SVM requiring the data set to be perfectlylinearly separable. The ability to classify non linear separable datais based on kernel-SVMs which are not treated further here.

In some cases, the data set is linearly separable in its general struc-ture with rare exceptions. In this case, the simple linear separatinghyperplane is not necessarily a bad decision, as a few misclassifiedpoints can be accepted if this results in a lower order model and ahigher grade of generalization. The plain SVM cannot cope withdata sets like this as the constraints cannot be fulfilled. Neverthe-less, extensions for SVMs exist in which an error measure for themisclassified points is included into the optimization problem byintroducing a slack variable [15]. They are called soft-margin SVMs.

To achieve good generalization results, the following rule ofthumb applies for support vectormachines as the number of supportvectors is a measure for the model complexity:

NSV ≤ N

10(11)

where NSV is the amount of resulting support vectors and N is thetotal number of training data points [16].

2.4.5 Dimensionality Reduction. Dimensionality reduction de-scribes a variety of different methods to reduce the dimensionalityof a given data set while retaining most of the data’s informationcontent. This provides several advantages when applying methodsof machine learning on the data. The computational effort requiredto learn a certain pattern is directly linked to the dimensionality ofthe data, also known as the curse of dimensionality [15]. In addition,a lower amount of dimensions in the training set improves general-ization as described in the Vapnik−Chervonenkis theory [17].

A popular method for the purpose of reducing input dimensionsis the principal component analysis (PCA). The PCA itself doesnot necessarily reduce the amount of dimensions, but maps thec-dimensional input space of N observations to a new space ofmin(c,N ) linearly uncorrelated variables, the so-called principalcomponents of the principal subspace [15]. Hereby, the first princi-ple component (PC) represents the direction of the largest possiblevariance in the data. All following PCs are orthogonal to each otherwith their associated variance descending. Actual dimensionalityreduction takes place when the smallest principle components areneglected for the training set, as they only represent a small amountof information in the original set. The PCA is based on the lineartransformation (12) of the input data H = (h1, h2, . . . , hN ) ∈ RN×c ,where each hn = (h1,h2, . . . ,hc ) denotes one c-dimensional ob-servation and X = (x1, x2, . . . , xN ) ∈ RN×min(c,N ) denotes theoutput as

X = ΓT · H. (12)

Hereby, the orthogonal transformation matrix Γ consists of theeigenvectors of the covariance matrix ΣH of H sorted by theirassociated eigenvalues in descending order. In our work, a PCA isapplied to reduce the dimensionality of the training data for theSVM.

3 CONCEPT AND SYSTEM ARCHITECTURE3.1 From Biology to Engineering: The Sound

Path Block ModelRegarding the analogies between human hearing and technicalsound paths, it is helpful to create a model where the correspondingparts and their connections are clearly visible. Since both pathsgot the exact same input (sound waves in their physical form) andnearly the same output as the brain’s and the Machine Learningsystem’s output both represent the location of the sound source,these two components flag the beginning and the end of the blockmodel in Fig. 3.

Starting at the left end of the block diagram, the physical soundis represented as white noise in the frequency domain shown inthe scope above the signal path. This choice was not necessarilymade because white noise is the desired signal to localize in a realapplication, but to show the distortion made by the following twoelements of the diagram. Assuming that this Sound Distortion causestrouble in the localization process is misleading; in fact this is thedetermining factor for the ability to allocate different sound sourcesto their correct positions.

The following Signal Conversion is rather uninteresting for theconsideration of the biological process as well as for the technicalrealization of localization, but has still to be part of the Sound Path asit represents the important transition from mechanical oscillationsinto the electrical signals to work with later. Note that the scopesbehind the conversion part do not show the frequency domainof the converted signal anymore, but labeled points representingsound samples in a simplified two-dimensional hyperspace.

The last part is a Decision Making process that can be modeledby several forms of Machine Learning with the result of separatedsound signals in a high-dimensional space. The last scope showsthe process of learning, that does not result in any informationgain as we already must have knowledge about the locations of thesound samples we learn with. Determining the Location requires tosubsequently classify sound samples with an unknown location toone of two classes by applying the final, trained system.

3.2 The Biological ModelSound entering human ears is distorted by the HRTF of its receiverdepending on the direction it comes from. The modified soundwaves propagate to the inner ear where the volute Cochlea convertsthe mechanical oscillations of sound into electrical impulses usingthousands of sensory cells. This data stream is then evaluated byour brain with the background knowledge of a few years experiencein hearing, resulting in an astonishing sound localization ability. Tounderstand the biological process of localizing sound sources andto use this knowledge in terms of technical application, the task oftransferring the process into a technical model arises.

Despite the fact that human localization hearing is mainly basedon binaural cues like ILD and ITD, the technical model is related


Engineering

BiologyOuter Ear

Room Impulse Response

Ear Canal

MicrophoneDirectivity

Colchea

ADC

Brain

MachineLearning

Sound Distortion Signal Conversion Decision MakingSound Location

Figure 3: The sound path block model

to hearing with a single ear (monaural). Trying to obtain binauralcues technically leads to the need of temporal and spatial accuracy,both not assumed to be available in this work. Instead, the processof technical modeling will focus on monaural cues.

3.3 The Technical ModelAlthough a normal microphone is not equipped with an ear likestructure on the outside, it can be characterized as a direction-selective filter as most of them do not provide the omnidirectionalcharacteristic desired for high class measuring microphones. De-spite that, this filter’s impact is little in contrast to the room impulseresponse providing the important sound falsification for technicalapplications. The amplitude and especially the temporal sequenceof early reflections in the RIR differ according to the position ofboth the sound source and the microphone. This fact constitutesthe decisive point when localizing sound sources technically, aseach series of reflections is almost unique and can be allocated to acombination of sender and receiver positions in most cases. ’Almostunique’ relates to a few sound source locations still causing confu-sion as it is described in the simulation chapter. For the technicalmodel, the microphone’s position stays constant while the localiz-able target is equipped with a sound emitting device. The soundfalsification is reciprocal, both emitter and receiver are suitable tobe the moving part of the system. The inner ear is modeled by ananalog to digital converter (ADC) which is not considered here.The complex decision making, performed by the human brain, isthe biggest task to deal with when modeling the process of soundlocalization in engineering. To obtain suitable, preprocessed datafor the upcoming Machine Learning part the received sound isshortened by cutting out the important early reflections. Hereby, alldigital samples before receiving the direct sound are neglected andthe remaining sound is trimmed to a specific amount of samples tolearn with. By applying a PCA, the data dimensionality is furtherreduced.

The resulting dataset of N observations each consisting of p prin-ciple components is then used as an input for the machine learningsystem, consisting of two SVMs in each stage of classification forhorizontal (ad or bc) and vertical (ab or cd) distinction, as shownin Fig. 4. For an n-th order localization system, 4n areas of locationcan be distinguished. Given a higher number of total observations

b

y

c

dab ac

adaa

xFigure 4: Sequential Quaternary Classification

and more accurate positions of the sound emitting device, a higherorder localization system can be learned.

4 SIMULATION ENVIRONMENT ANDRESULTS

To investigate the influence of certain parameters on the perfor-mance of the localization system, simulations are performed inthe mathematical computation environment MATLAB. The pseudocode below shows the schematic information flow implementedin the simulation with Ni, j ∼ N(0,√q), H = (h1, h2, ..., hN+N /2)T ,hn = (hn (1), ...,hn (c)),X = (x1, x2, . . . , xN+N /2)T and xn = (xn (1),. . . ,xn (p)).

Note that performing the PCA means calculating the transfor-mation matrix Γ solely using the training set and afterwards trans-forming the complete data set with this Γ, as we assume the test setnot to be available in the learning process. For reasons of simplicity,this is not shown in the pseudo code. The code also contains theabstract functions cut() and reduce(). The function cut trims thesignals contained in H to the specific amount of samples c startingwith the first sample to reach a certain threshold, while reducesimply returns the given matrix trimmed to the first p principlecomponents. The default parameters for the number of trainingsounds N , the number of principle components p, the noise in-tensity q and the cutoff sample length c are N = 256, p = 100,q = 0.0001 and c = 1000. Apart from the currently investigatedparameter, all other parameters are fixed at their default value. Toobtain the training data for a specific position of sender and re-ceiver, the stimulating sound s(k), a rectangular pulse, is convolved


Algorithm 1 Simulation Routine

Define room properties and stimulation signal s(k)Define default parameters N ,q,p, cfor all values of the investigated parameter do

Change the investigated parameterfor r = 1 to 100 do

Create N + N /2 random sender positionsCalculate the N + N /2 room impulse responses дn (k)for n = 1 to N + N /2 do

Calculate the received signals hn (k) = s(k) ∗ дn (k)end forAdd noise according to q: H = H + NCut the signals: H = cut(H, c).Perform PCA (12): X = ΓT · H.Reduce dimensions: X = reduce(X,p)

Add the first N rows of X to the training setAdd the remaining N /2 rows of X to the test set

Train the system with the training setCalculate er by classifying the test set

end forCalculate mean error rate e (13) over all runs

end for

with the room impulse response д(k) of this specific location giventhe properties of the room. After adding white Gaussian noise andapplying the previously described cutoff, a PCA reduces the digitalsound to a specific amount of principle components. Besides thethree parameters Noise Intensity, Number of Samples and PCA Di-mensionality, the Number of Training Sounds used for learning isinvestigated in the simulation as the described routine calculatesN + N /2 sounds in total for the training and the testing set.

4.1 Preliminary InvestigationsInvestigations on the optimal microphone position reveal localiza-tion problems, if the recording device is located on the symmetryaxes of the room. In this case, several positions produce the sameRIR and cannot be distinguished. Not only the right position of asingle microphone, but also the use of a second microphone canincrease the localization accuracy. The simulation reveals the posi-tion of the second microphone to be optimal if both microphonesare placed on an imaginary circle centered inside the room with anangle of 90 degree between them.

4.2 Simulation RoutineFor the upcoming simulation according to the simulation routine inAlgorithm 1, two microphones are placed as described above. Themeasured quantity for all investigations is the mean error rate e , asdefined in (13), averaged over rmax = 100 runs for every parametervalue.

e =1

rmax

rmax∑r=1

# misclassified test sounds in run r# total test sounds

(13)

In addition, the range of one standard deviation σ is given toassess the spread over all runs of the simulation.

4.3 Investigation ResultsEach simulation has been performed with a second order system ina room with the dimensions 12m×9m×3m and the reflection factor|R | = 0.4. Microphones are located atm1 = (11.2m, 4.5m, 2m) andm2 = (6m, 0.5m, 2m). The simulation results are shown in Fig. 5.

4.3.1 Parameter: Noise Intensity. In the simulation, the noiseintensity has been varied by changing the variance of the discreteadditive white Gaussian noise. Nevertheless, the error rate’s de-pendency is difficult to imagine if the graph’s abscissa shows theplain noise intensity. Furthermore, the results given an absolutevalue for q would then be tailored to the specific sound stimulationassumed in the simulation. Therefore the error rate is evaluatedfor the expected mean Signal-to-Noise Ratio (SNR) resulting froma specific q. As expected, the simulation reveals high levels of en-vironmental noise to disturb the localization system causing anincreased localization error. Especially for industrial applications,it is necessary to adapt the signal level (sound volume) to the envi-ronment to ensure a sufficient SNR for localization. Furthermore,the effort in increasing the signal level or avoiding noise at any costis not meaningful if a sufficient SNR is already present as the errorrate reaches saturation at a certain point.

4.3.2 Parameter: Cut-Off Samples. The amount of samples statesanother parameter that can be optimized regarding the environmen-tal conditions. As described in earlier sections, the early reflectionsconstitute the important part of the room impulse response in thecontext of reflection-based localization. As the RIR’s characteristicbecomes more diffuse and stochastic with progressing time, theamount of location-specific information decreases. The additionalsamples do not provide any additional value for the system, how-ever, increasing the dimensionality to cope with causing the errorrate to rise.

Since the early reflections last longer for bigger rooms, theamount of samples recorded has to be adapted to the environmentalproperties in order to optimize the localization results.

A possible explanation for the error rate decreasing again whilefurther increasing the number of samples could be a better gener-alization due to the spread of the data points caused by the smallreverberation noise.

4.3.3 Parameter: Principle Components. With the informationcontent of every additional principle component decreasing, theerror rate reaches saturation if enough information is already con-tained in the training set. To get an impression of how much infor-mation is contained in one principle component, its correspondingeigenvalue can be divided through the total sum of all principle com-ponent’s eigenvalues. The cumulative sum of those values up to thep’th component indicate how much variance (which is correspond-ing to the information content) of the original data is concentratedin the training set consisting of the first p components. Assumingthe default system parameters, the first 20 principle componentscontain about 63%, 40 PCs contain about 81% and 60 PCs containabout 87% of the original data’s variance.


-10 0 10 20 30 40 50

SNR (dB)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Err

or

Rate

1- Area

Mean

Area

Mean

Area

Mean

(a) Noise Intensity (Signal-To-Noise Ratio)

0 500 1000 1500 2000

Samples

0

0.05

0.1

0.15

0.2

0.25

Err

or

Ra

te

1- Area

Mean

Area

Mean

(b) Number of Samples

0 20 40 60 80 100

Principle Components

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Err

or

Rate

1- Area

Mean

Area

Mean

(c) Number of Principal Components

0 200 400 600 800 1000 1200 1400

Total Training Sounds

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Err

or

Ra

te

Area

Mean

(d) Number of Training Data

Figure 5: Simulation Results - Dependency of the mean error rate e on different parameters

4.3.4 Parameter: Number of Training Sounds. The number ofrecorded sounds for the total system to learn with is the only param-eter for which the error rate does not reach saturation at a certainpoint. The ability to record a specific number of sounds for each po-sition is often limited by the time than can be spent while trainingthe system. The only disadvantages of a high number of trainingsounds are the increased effort to put in recording those soundsand the increased computation time to learn the model, while lat-ter is very small compared to the recording time. In general, it isworthwhile to produce as much training data as possible.

5 FIELD TESTThe proposed localization system is tested to verify it’s functionalityunder realistic conditions. We localize the microphone with a fixedspeaker position for practical feasibility. Microphones are muchsmaller than speakers, that are able to produce the level of soundpressure needed for localization and thus, they are more suitable tobe the moving part of the system. This adjustment doesn’t affect theability to localize, since the mandatory unique sound falsificationdepending on the microphone position is still given.

Fig. 6 shows the localization scenario in which four positionsof the microphone (gray circles) located above a desk are to bedistinguished with about 0.5m . . . 1.5m space between them in afairly large room (≈ 200m2). The recordings have been obtained

Speaker

1

2

34

Figure 6: Field Test - Localization Scenario

using a beyerdynamic MM-1 [19] measuring microphone and aFocusrite Scarlett 2i2 [20] audio interface. The stimulating sound isproduced by a Neumann KH120A [21] studio monitor, decoupledfrom the desk to avoid an influence on the microphones in formof vibrations. Measurements obtained with this semi-professionalaudio setup are of high quality (24bit/44.1kHz) with a very lownoise level. In addition, training sounds in the presence of realdisturbances have been recorded as shown for a measurement atposition 1 in Fig. 7.


Figure 7: Field test with disturbance

Figure 8: Training signals at microphone position 1-4

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

1st Principle Component

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

2nd P

rinci

ple

Com

ponent

Figure 9: Principle components of signals

Fig. 8 shows 20 recorded signals for each of the microphonepositions 1-4. The graphs consist of the first 10000 samples takenafter the largest peak (the threshold, seeAlgorithm 1) in the recordedsignal.

The first peak and the early reflections, the acoustic fingerprintof each individual position, is clearly visible and reflected in Fig. 9,where only the first two principle components obtained by the PCAare displayed. Robustness against disturbance and noise is crucialunder the presence of perturbations or worse recording quality.For further research, a scenario with large disturbances, typicallypresent in industrial scenarios, will be investigated.

6 CONCLUSIONIn this work, a localization framework based on sound reflectionsand distortions is proposed, inspired by the natural ability of hu-mans to localize sound sources. Its properties and dependencies onseveral parameters have been investigated by simulation leading tomore or less expected results. Based on these investigations, realsignals have been recorded and separated in a localization scenariowith the conclusion of reflection-based sound localization not onlyto be possible in theory, but also in a practical application.While thefield test promises good prerequisites for this area, robustificationof the localization approach can be considered as the crucial taskfor further research, especially regarding industrial applications.

ACKNOWLEDGMENTThis project has been funded by the federal ministry of educationand science (BMBF) grant 01IS17073. This work has also been sup-ported in part by TUBITAK under grant 115E827.

REFERENCES[1] C. Chen, Y. Chen, H.-Q. Lai, Y. Han, K.J.R. Liu; "High accuracy indoor localization:

A WiFi-based approach" in IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) (2016);

[2] J.-M. Valin, F. Michaud, J. Rouat, D. LÃľtourneau; "Robust Sound Source Localiza-tion Using aMicrophoneArray on aMobile Robot" in Proc. IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), pp. 1228-1233 (2003); doi:10.1109/IROS.2003.1248813

[3] P.M. Hofman, J.G.A. Van Riswick and A.J. Van Opstal; "Relearning sound localiza-tion with new ears" in Nature Neuroscience 1, 417 - 421 (1998); doi: 10.1038/1633

[4] A. Saxena and A. Y. Ng; "Learning Sound Location from a single microphone" inProceedings of the IEEE International Conference on Robotics and Automation(ICRA ’09), pp. 1737-1742; doi: 10.1109/robot.2009.5152861

[5] A.J.R. Simpson, G. Roma, M.D. Plumbley; "Deep Karaoke: Extracting Vocals fromMusical Mixtures Using a Convolutional Deep Neural Network"; arXiv:1504.04658

[6] M.S.Brandsteina and H.F.Silverman; "A practical methodology for speech sourcelocalization with microphone arrays" in Computer Speech & Language Volume11, Issue 2, Pages 91-126 (1997); doi: 10.1006/csla.1996.0024

[7] D.V. Rabinkin, R.J. Renomeron, A.J. Dahl, J.C. French, J.L. Flanagan, et al.; "Proc.SPIE 2846, Advanced Signal Processing Algorithms, Architectures, and Imple-mentations VI, 88" (October 22, 1996); doi:10.1117/12.255464

[8] J. Weng and K.Y. Guentchev; "Three-dimensional sound localization from a com-pact non-coplanar array of microphones using tree-based learning" in The Journalof the Acoustical Society of America 110, 310 (2001); doi: 10.1121/1.1377290

[9] R. Feynman; "Lectures in Physics, Volume 1", Addison Publishing Company(1969), Addison; ISBN: 978-0201021158

[10] H. Kuttruff; "Akustik: Eine Einführung", S. Hirzel (2004); ISBN:978-3777612447[11] Stephen McGovern; "Room Impulse Response Generator"; URL:

https://de.mathworks.com/matlabcentral/fileexchange/5116-room-impulse-response-generator

[12] J. Blauert; "Sound localization in the median plane" in Acustica 22:205-213 (1969).[13] Lord Rayleigh O.M. Pres. R.S.; "XII. On our perception of sound direction" in Philo-

sophical Magazine Vol. 13 Iss. 74,1907 (1907); doi: 10.1080/14786440709463595[14] I.P. Pavlov; "Conditioned Reflexes: An Investigation of the Physiological Ac-

tivity of the Cerebral Cortex", Martino Fine Books (Reprint - 2015); ISBN: 978-1614277989

[15] C.M. Bishop; "Pattern Recognition and Machine Learning", Springer (2007); ISBN:978-0-387-31073-2

[16] Y.S. Abu-Mostafa; "Learning From Data", AMLBook (2012); ISBN: 978-1600490064[17] V. Vapnik, "The Nature of Statistical Learning Theory", Springer (1995); ISBN

978-1-4757-3264-1[18] S. Boyd, L. Vandenberghe; "Convex Optimization", Cambridge University Press

(2004); ISBN: 978-0521833783[19] beyerdynamic GmbH & Co. KG Heilbronn; URL: www.beyerdynamic.com[20] Focusrite Plc.; URL: www.focusrite.com[21] Georg Neumann GmbH Berlin; URL: www.neumann.com/

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Draft - RWTH Aachen ICE · Modern process automation and the industrial evolution heading towards...

Documents