Audio processing: lab sessionsdspuser/dasp/... · layed version of the HRTFs is typically included,...

Audio processing: lab sessions

Session 4: Binaural synthesis and 3D audio

Alexander Bertrand and Giuliano Bernardi1

December 2014

• Goals:I) Reproducing a virtual sound source by means of binaural synthesisII) Binaural synthesis with cross-talk cancellation

• Required files from course website: sim environment.zip, speech1.wav, speech2.wav,HRTF.mat (download from http://homes.esat.kuleuven.be/∼dspuser/dasp/ )

• Required files from previous sessions: create micsigs.m

• Outcome: 2 m-files: binaural synthesis.m, crosscancel.m

• Deliverables: GSC VAD.m, crosscancel.m

Remark: The experiments in this session are computationally intensive. Therefore, set thesampling frequency to fs = 8000 Hz in all experiments.

Introduction

Consider the speech communication system depicted in Fig. 1. In a (noisy) recording room (lefthandside), the DOA of a target speech source is estimated by means of the MUSIC algorithm, after whichthe speech signal is recorded and denoised by means of a beamformer. The speech signal is thenplayed back in a far-end listening room (righthand side), where the original position of the targetspeech source is reproduced by means of a 3D audio effect using an array of loudspeakers. Thesystem in the recording room has been developed in sessions 2 and 3. In this session, we will focuson the righthand side of Fig. 1, i.e., generating a 3D audio effect by means of a loudspeaker array.

Exercise 4-1: Binaural synthesis

1. Write a new m-file binaural synthesis.m to generate a binaural signal as follows:

• Read out the wav file speech1.wav, and resample it to 8kHz (see session 1). Truncatethe signal to a length of 10 seconds and store it in the variable x.

• Make two copies of x and store both of them in the columns of a 2-column matrix variablebinaural sig1 1=[x x], which will serve as a binaural signal where the first column isplayed back at the left ear, and the second column is played back at the right ear.

• Create a second binaural signal as binaural sig2 1=[x 0.5*x], i.e., decrease the am-plitude of the right channel (second column) with 50%.

1For questions and remarks: [email protected]

1

Figure 1: Speech communication with 3D audio effects.

• Create a third binaural signal binaural sig3 1, which is a copy of binaural sig1 1,but add a delay of 3 samples in the right channel.

• Create a fourth binaural signal binaural sig4 1 by filtering the left and the right channelof binaural sig1 1 with the filters in the first and second column of the matrix variablein HRTF.mat, respectively. The two filters in HRTF.mat are head related transfer functions(HRTFs) for the left and the right ear.

• Repeat this procedure for the second wav file (speech2.wav), to generate four new binau-ral signals binaural sig2 1, binaural sig2 2, binaural sig2 3, and binaural sig2 4,but this time reverse the left and right channels, e.g., binaural sig2 2 is then definedas binaural sig2 2=[0.5*x x].

• Generate binaural sig1=binaural sig1 1+binaural sig1 2, and proceed similarly forbinaural sig2, binaural sig3, and binaural sig4.

2. If you have headphones or earphones2: listen to the created binaural signals binaural sig1,binaural sig2, binaural sig3, and binaural sig4 through the head/earphones. Make surethat the left channel is played back by the left loudspeaker and the right channel by the rightloudspeaker (use the command soundsc). Which effect do you observe? Pay special attentionto binaural sig3 (note that, remarkably enough, there was no manipulation in the amplitudeof the signal (!)).

3. Explain the effect of the different manipulations that have been used to create binaural sig2,binaural sig3, and binaural sig4.

4. After listening to binaural sig4: can you make a rough estimate of the DOA of the soundsource that was used to measure the HRTFs? Was it measured in a reverberant room or inan anechoic room?

Exercise 4-2: 3D audio with crosstalk cancellation

The goal of this exercise is to design the FIR filters gj [t] in Fig. 1 such that the listener has theimpression that the signal played by the loudspeaker array is coming from a given target direction.

2Borrow a pair from the TA if you don’t.

2

Let XL(z) and XR(z) denote the left and right HRTF corresponding to this target direction, thenwe design the filters Gj(z), j = 1 . . .M (in the z-domain) such that (explain!)

[H1L(z) H2L(z) . . . HML(z)H1R(z) H2R(z) . . . HMR(z)

]G1(z)G2(z)

...GM (z)

=

[XL(z)XR(z)

](1)

where HjL(z) and HjR(z) denote the acoustic transfer functions from loudspeaker j to the left andright ear of the listener, respectively.

In the sequel, we will assume that the RIRs corresponding to HjL(z) and HjR(z), j = 1, . . . ,M ,are known, e.g., by using training sequences during a calibration phase (this is beyond the scope ofthis exercise).

1. Using pen and paper, formulate the discrete time-domain representation of (1), where it isassumed that all filters are FIR. Use a matrix description of the different convolution opera-tions by replacing Gj(z) with a vector of length Lg, and HjL(z) with a (Lh + Lg − 1) × Lg

Toeplitz matrix, where Lh denotes the length of the RIR corresponding to HjL(z), and whereLg denotes the length of the (unknown) FIR filter corresponding to Gj(z) (we assume that Lg

and Lh are independent of j and L/R). Keep the paper with the derivation with you duringthe milestone.

Hint: Refer to the part about MINT in the uploaded session4 2014 support material.pdf.

2. The time-domain representation of (1) results in a system of equations (SOE)

Hg = x (2)

in the unknowns g = [gT1 . . . gT

M ]T , where gj is a vector of length Lg. Derive a condition forthe (minimum) value for Lg as a function of M and Lh, such that the SOE (2) can be solvedexactly. What is the minimum number of loudspeakers M required to create a perfect 3Daudio impression?

3. The HRTFs XL(z) and XR(z) are modeled as FIR filters. However, in the SOE (2), a de-layed version of the HRTFs is typically included, i.e., z−∆XL(z) and z−∆XR(z), with ∆ apre-defined delay. Explain why this delay is added.

Hint: Pay special attention to the first equation and the (Lh + Lg)-th equation of the SOEand assume, e.g., that both HRTFs are equal to e1 = [1 0 . . . 0]T , i.e., the target directionis 90◦. If all the loudspeakers are closer to the left ear than to the right ear, a problem willoccur in the equations. Another problem will occur if there is a loudspeaker that is closer toboth ears of the listener compared to all other loudspeakers.

4. We will use the RIRs computed by mySA GUI.m, to generate the signals impinging at the leftand right ear of the listener in Fig. 1. To this end, create a microphone array with twomicrophones with an inter-microphone distance of 15cm (these microphones represent the twoears of the listener). Add M = 5 sound sources around the microphone array, and make sure

3

that all of them are at a similar distance from the microphone array (these will represent thefive loudspeakers). Set T60=0.5s and set the sampling rate to 8 kHz. Store the resultingimpulse responses, and use them in the sequel.

5. Write a new m-file crosscancel.m performing the following tasks:

• Load Computed RIRs.mat and set the length of your RIRs to 400 for now (this is toobtain a reasonable computation time when solving (2)).

• Create two vector variables xL and xR that represent the two HRTFs (for now, defineboth HRTFs as the one-tap filter [1], to create a target source in the frontal direction).In the sequel we assume that the length of the HRTFs is much smaller than Lh (notethat this may require a truncation of the HRTFs if this assumption is not satisfied).

• Construct the matrix H and the vector x of the SOE (2). Do not forget to add the delay∆ (see part 3 of this exercise). Set this delay to, e.g.,

Delta=ceil(sqrt(room dim(1)̂2+room dim(2)̂2)*fs RIR/340) .

Explain why this is a safe choice.

• Remove the all-zero rows in H, and the corresponding elements in x, and then solvethe SOE (2) (Hint: use g=H\x in matlab). Does the removal of the all-zero rows in H

influence the solution? Why (not)?

• Create a figure that plots the result of H*g (in blue), together with the vector x (in red),and print the value synth error=norm(H*g-x) in the matlab workspace.

• Read out the wav file speech1.wav, and resample it to 8kHz. Truncate the signal to alength of 10 seconds.

• Create a 2-column matrix variable binaural sig which contains the two signals observedby the left and right ear of the listener in Fig. 1, when the array of 5 loudspeakers definedwith myGUI SA.m plays the speech signal speech1.wav filtered with the 5 synthesis filtersstacked in g (do not forget to resample the speech signal to 8kHz).

6. Test your code on the scenario mentioned earlier (and listen to the binaural signal binaural sig,it should not be distorted).

Only if synth error is smaller than 10−10, you can proceed to the next step.

7. Re-run the experiment, but set length RIR in the GUI to 1500 taps (using larger values is atyour own risk...). Note that the computation time may become quite large (up to 1-2 minutes).

Only if synth error is smaller than 10−10, you can proceed to the next step.

8. Manipulate the HRTFs xL and xR in the same way as in exercise 4-1 (to create 4 differentbinaural signals), and listen to the results. Do you hear a similar effect? Is there still an effectfrom the crosstalk?

9. Can you create a target source coming from the lefthand side of the listener if all the loud-speakers are on the righthand side of the listener (as in the setup of Fig. 1)? Why (not)?

4

10. Is the sweetspot, i.e., the region in which the listener can move while still hearing the correct3D audio effect, larger or smaller than 5 cm? You can measure this by using a 5-microphonearray with an intermicrophone distance of 5cm. Compute the synthesis filters based on mi-crophones 1 and 4, and listen to microphone signals 2 and 5.

Remark: Note that this is not a realistic experiment to measure the actual sweet spot. Inpractice, one will need a more accurate model of the acoustic transfer functions (i.e. at a muchhigher sampling rate, see also session 1).

11. Add white noise to the elements in H with a standard deviation of 5% of the standard devia-tion of the elements in the first column of H. This noise models estimation errors during thecalibration phase when identifying the RIRs. Re-run the experiment with these noisy RIRs,and analyze the outcome (i.e. design g based on the noisy RIRs and analyse the binauralsignals with the exact RIRs). What does this tell you about the sensitivity of the crosstalkcancelation procedure with respect to modeling errors?

Remark: Note that there exist more advanced techniques to make the method more robustto modeling errors. However, these are beyond the scope of this session.

Milestone Demo 2

When you are ready with the above exercises, e-mail the two m-files GSC VAD.m (ex. 3-4) andcrosscancel.m (ex. 4-2) in one zip-package named

Milestone2 S<1,2> group0<1,8>.zip,

where the first brackets contain your slot number (1 for Wednesday labs, 2 for Thursday labs) andthe second brackets contain your group number, to

[email protected],

and add the group number and all the names of the group members in the source code, as well asin the e-mail. Do not send additional files.

In case you finish up already during the session, call the supervising TA and give a demo. Other-wise, you give the demo at the beginning of the next session. In this demo, you briefly demonstratethe two m-files listed above on scenarios that are defined on the spot by the TA. The TA may askadditional questions to assess your insights (some of these questions are already listed throughoutthe task descriptions).

5

Date post:	18-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Audio processing: lab sessionsdspuser/dasp/... · layed version of the HRTFs is typically included,...

Documents