+ All Categories
Home > Documents > Multichannel speech intelligibility and talker recognition using monaural, binaural, and...

Multichannel speech intelligibility and talker recognition using monaural, binaural, and...

Date post: 12-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
12
Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation Rob Drullman a) and Adelbert W. Bronkhorst TNO Human Factors Research Institute, Department of Perception, P.O. Box 23, 3769 ZG Soesterberg, The Netherlands ~Received 13 January 1999; revised 19 July 1999; accepted 17 December 1999! In a 3D auditory display, sounds are presented over headphones in a way that they seem to originate from virtual sources in a space around the listener. This paper describes a study on the possible merits of such a display for bandlimited speech with respect to intelligibility and talker recognition against a background of competing voices. Different conditions were investigated: speech material ~words/sentences!, presentation mode ~monaural/binaural/3D!, number of competing talkers ~1–4!, and virtual position of the talkers ~in 45°-steps around the front horizontal plane!. Average results for 12 listeners show an increase of speech intelligibility for 3D presentation for two or more competing talkers compared to conventional binaural presentation. The ability to recognize a talker is slightly better and the time required for recognition is significantly shorter for 3D presentation in the presence of two or three competing talkers. Although absolute localization of a talker is rather poor, spatial separation appears to have a significant effect on communication. For either speech intelligibility, talker recognition, or localization, no difference is found between the use of an individualized 3D auditory display and a general display. © 2000 Acoustical Society of America. @S0001-4966~00!01104-3# PACS numbers: 43.66.Pn, 43.66.Qp, 43.72.Kb @DWG# INTRODUCTION In various communication systems, such as those used for teleconferencing, emergency telephone systems, aeronau- tics, and ~military! command centers, there may be a need to monitor several channels simultaneously. Conventional sys- tems present speech over one or two channels, which may lead to reduced intelligibility in critical situations, i.e., when more than two talkers are talking at the same time. Alterna- tively, the signals can be presented by means of a 3D audi- tory display, where sounds presented over headphones are filtered binaurally in such a way that they seem to originate from virtual sources in a space around the listener. As in normal ~nonheadphone! listening, the capacities of the hu- man auditory system are used much better with such a 3D system, particularly with respect to sound localization and spatial separation. Spatial separation of the voices improves speech perception ~‘‘cocktail party effect,’’ cf. Cherry, 1953! and may also facilitate the identification of the talkers. Spatialized or 3D audio over headphones is obtained by filtering an incoming signal according to head-related trans- fer functions ~HRTFs!. These transfer functions are an essen- tial part of a 3D auditory display, because they simulate the acoustic properties of the head and ears of the listener, on which spatial hearing is based. HRTFs are essentially a set of filter pairs that contain the directional information of the sound source as it reaches the listener’s eardrums. When listening over headphones, substituting the transfer from headphone to eardrums by the HRTFs results in the percep- tion of a virtual sound outside the head of the listener. Thus, an external sound source can be simulated for any direction for which the HRTFs exist. Several studies on the efficacy of 3D auditory displays for speech communication have shown positive results. These results were obtained by both HRTF processing and more generic binaural listening techniques. Bronkhorst and Plomp ~1992! used artificial-head ~KEMAR! recordings of short sentences in frontal position and temporally modulated speech noise ~simulating competing talkers! at various other azimuths in the horizontal plane. They evaluated intelligibil- ity in terms of the speech-reception threshold ~SRT!, i.e., the speech-to-noise ratio needed for 50% intelligibility. For normal-hearing listeners, the gain occurring when one to six noise maskers were moved from the front to positions around the listener varied from 1.5 to 8 dB, depending on the num- ber of maskers and on their positions. Begault and Erbe ~1994! used nonindividualized HRTF filtering for spatializing four-letter words ~‘‘call signs’’! against a background of diotic multitalker babble. With naive listeners, an advantage of up to 6 dB in SRT was found for 3D presentation ~60 and 90° azimuths! compared with diotic presentation. In a subsequent study, Begault ~1995! used words against diotic speech noise. Again, at 690° azimuth an average advantage of 6 dB in SRT re diotic presentation was found. Ricard and Meirs ~1994! used nonindividualized HRTFs to measure the SRT of synthetic speech in the hori- zontal plane against a bandlimited white-noise masker in frontal position. They found an average maximum threshold decrease of 5 dB when the speech source was shifted to 690° azimuth. The question how intelligibility is affected when mul- tiple talkers are presented at different ~virtual! positions, in a a! Electronic mail: [email protected] 2224 2224 J. Acoust. Soc. Am. 107 (4), April 2000 0001-4966/2000/107(4)/2224/12/$17.00 © 2000 Acoustical Society of America
Transcript

Multichannel speech intelligibility and talker recognitionusing monaural, binaural, and three-dimensionalauditory presentation

Rob Drullmana) and Adelbert W. BronkhorstTNO Human Factors Research Institute, Department of Perception, P.O. Box 23, 3769 ZG Soesterberg,The Netherlands

~Received 13 January 1999; revised 19 July 1999; accepted 17 December 1999!

In a 3D auditory display, sounds are presented over headphones in a way that they seem to originatefrom virtual sources in a space around the listener. This paper describes a study on the possiblemerits of such a display for bandlimited speech with respect to intelligibility and talker recognitionagainst a background of competing voices. Different conditions were investigated: speech material~words/sentences!, presentation mode~monaural/binaural/3D!, number of competing talkers~1–4!,and virtual position of the talkers~in 45°-steps around the front horizontal plane!. Average resultsfor 12 listeners show an increase of speech intelligibility for 3D presentation for two or morecompeting talkers compared to conventional binaural presentation. The ability to recognize a talkeris slightly better and the time required for recognition is significantly shorter for 3D presentation inthe presence of two or three competing talkers. Although absolute localization of a talker is ratherpoor, spatial separation appears to have a significant effect on communication. For either speechintelligibility, talker recognition, or localization, no difference is found between the use of anindividualized 3D auditory display and a general display. ©2000 Acoustical Society of America.@S0001-4966~00!01104-3#

PACS numbers: 43.66.Pn, 43.66.Qp, 43.72.Kb@DWG#

snty

mnnaud

ati

-3nv

bnsn

th,etehoceu

tion

yslts.andnd

ted

il-

orsix

undm-

vefor

ori-in

oldd to

l-

INTRODUCTION

In various communication systems, such as those ufor teleconferencing, emergency telephone systems, aerotics, and~military! command centers, there may be a needmonitor several channels simultaneously. Conventional stems present speech over one or two channels, whichlead to reduced intelligibility in critical situations, i.e., whemore than two talkers are talking at the same time. Altertively, the signals can be presented by means of a 3D atory display, where sounds presented over headphonesfiltered binaurally in such a way that they seem to originfrom virtual sources in a space around the listener. Asnormal ~nonheadphone! listening, the capacities of the human auditory system are used much better with such asystem, particularly with respect to sound localization aspatial separation. Spatial separation of the voices improspeech perception~‘‘cocktail party effect,’’ cf. Cherry, 1953!and may also facilitate the identification of the talkers.

Spatialized or 3D audio over headphones is obtainedfiltering an incoming signal according to head-related trafer functions~HRTFs!. These transfer functions are an essetial part of a 3D auditory display, because they simulateacoustic properties of the head and ears of the listenerwhich spatial hearing is based. HRTFs are essentially a sfilter pairs that contain the directional information of thsound source as it reaches the listener’s eardrums. Wlistening over headphones, substituting the transfer frheadphone to eardrums by the HRTFs results in the pertion of a virtual sound outside the head of the listener. Th

a!Electronic mail: [email protected]

2224 J. Acoust. Soc. Am. 107 (4), April 2000 0001-4966/2000/107

edau-os-ay

-i-

areen

Ddes

y--eonof

enmp-

s,

an external sound source can be simulated for any direcfor which the HRTFs exist.

Several studies on the efficacy of 3D auditory displafor speech communication have shown positive resuThese results were obtained by both HRTF processingmore generic binaural listening techniques. Bronkhorst aPlomp ~1992! used artificial-head~KEMAR! recordings ofshort sentences in frontal position and temporally modulaspeech noise~simulating competing talkers! at various otherazimuths in the horizontal plane. They evaluated intelligibity in terms of the speech-reception threshold~SRT!, i.e., thespeech-to-noise ratio needed for 50% intelligibility. Fnormal-hearing listeners, the gain occurring when one tonoise maskers were moved from the front to positions arothe listener varied from 1.5 to 8 dB, depending on the nuber of maskers and on their positions.

Begault and Erbe~1994! used nonindividualized HRTFfiltering for spatializing four-letter words~‘‘call signs’’ !against a background of diotic multitalker babble. With nailisteners, an advantage of up to 6 dB in SRT was found3D presentation~60 and 90° azimuths! compared with dioticpresentation. In a subsequent study, Begault~1995! usedwords against diotic speech noise. Again, at690° azimuthan average advantage of 6 dB in SRTre diotic presentationwas found. Ricard and Meirs~1994! used nonindividualizedHRTFs to measure the SRT of synthetic speech in the hzontal plane against a bandlimited white-noise maskerfrontal position. They found an average maximum threshdecrease of 5 dB when the speech source was shifte690° azimuth.

The question how intelligibility is affected when mutiple talkers are presented at different~virtual! positions, in a

2224(4)/2224/12/$17.00 © 2000 Acoustical Society of America

ar

ioont

aihecfrorogr

nl

bForeithano

Br

,nlegup

nt

hem

r65

eetna

ndiiti-

hare

frotu

d

ft for

rerxts

eret ofer

heserto

-

s;a-

asle-sig-

tIn3Dinas

ef-

nrean

d andQ8z..14

hedn-

by

ed-in-tally,ce

way that each talker could be understoodper se—a situationdifferent and presumably more difficult than using noisemasker—was recently studied by a number of authoCrispien and Eherenberg~1995! used HRTF filtering for fourconcurrent talkers, each at a different azimuth and elevatpronouncing short sentences. While listeners knew the ption of the desired talker and the same stimuli were presethree times, the intelligibility scores for words~not entiresentences! were on average 51%. In a simulated cocktparty situation, Yostet al. ~1996! used bandlimited speec~words! uttered by up to three simultaneous talkers. Spewas presented over seven possible loudspeakers in asemicircle around either a human listener, a single micphone, or a KEMAR. With three concurrent talkers, averaword intelligibility for all utterances together were simila~about 40%! for live listening and listening toKEMAR recordings, whereas monaural listening scored o18%. Peissig and Kollmeier~1997! measured subjectiveSRTs with HRTF filtering for sentences at 0°, maskedmaximally three concurrent talkers at various azimuths.normal-hearing listeners the maximum gains relative to psenting all talkers at 0° were 8 and 5 dB, for conditions wtwo and three competing talkers, respectively. EricsonMcKinley ~1997! measured sentence intelligibility for twand four concurrent talkers in pink noise~65 and 105 dBSPL respectively!, using a speech-to-noise ratio of 5–10 dSubjects had to reproduce sentences that contained a cecall sign ~i.e., the talker to monitor was not fixed!. Diotic,dichotic, and directional presentations~KEMAR recordingsin the horizontal plane! were compared. With two talkersscores were more than 90% for both dichotic and directiopresentation when the two talkers were separated by at90°. With four talkers and low-level noise, the advantafrom directional over diotic presentation with a mixed groof male and female talkers was maximally about 30%~90°separation of the talkers!. Finally, Hawleyet al. ~1999! usedup to four concurrent sentences from one talker preseover either loudspeakers or headphones~KEMAR record-ings! in seven azimuths in the front horizontal plane. Tazimuth configurations varied by taking different minimuangles between target and nearest competing speech. Wall sentences originated from different positions, keywoscores for three competing sentences varied from about~nearest competing sentence 30–90° from target! to about90% ~nearest competing sentence 120° or more from targ!.

In the present study, two types of speech material wused ~words, sentences!, spoken by up to five concurrentalkers, and the virtual positions of the talkers in the horizotal plane were varied systematically. The listener’s task wto attend to a single target talker in the presence of onemore competing talkers. Presentation of the speech sigwas done monaurally, binaurally, or via 3D audio. In adtion, two modes of 3D presentation were used: one windividualized HRTFs and one with general HRTFs. Indvidualized HRTFs were adapted to the specific acoustic cacteristics of the individual listener and had to be measufor each person, whereas general HRTFs were obtainedjust one person and were used by all listeners. Several sies ~Wightman and Kistler, 1989b; Wenzelet al., 1993;

2225 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

ss.

n,si-ed

l

hnt-e

y

yr-

d

.tain

alaste

ed

hend%

tre

-s

orals-h

r-dmd-

Bronkhorst, 1995! have demonstrated that individualizeHRTFs—and individualized headphone calibration~Pralongand Carlile, 1996!—are important for correct localization ovirtual sound sources, but their use may be less relevanspeech intelligibility~cf. Begault and Wenzel, 1993!.

Apart from speech intelligibility, two more aspects weinvestigated in this study,viz., talker recognition and talkelocalization. These points are relevant within certain conte~e.g., teleconferencing, military communication!, as it is notonly important to knowwhat is being said, but alsowhoandwhere the talker is. Except for Pollacket al. ~1954!, whoused different combinations of two concurrent talkers, thhas not to our knowledge been any study on the effecmultiple talkers on talker identification or verification, eithwith humans or machines.

In summary, the present study extends the approactaken by previous studies with multiple talkers in a numbof ways:~1! two types of speech material were employedmeasure intelligibility;~2! testing was done with both individualized and general HRTFs~most of the earlier studiesused KEMAR!; ~3! monaural ~monotic! and binaural~di-chotic! conditions were studied in addition to 3D conditionand ~4! talker recognition and talker localization were mesured in addition to speech intelligibility.

Many of the applications employed for 3D audio,mentioned in the first paragraph, make use of radio or tephone communication. Hence, listeners hear the speechnals through a limited bandwidth. Begault~1995! has shownpositive results of 3D audio for both full~44.1-kHz sam-pling! and low ~8-kHz sampling! bandwidth systems. Yoset al. ~1996! used utterances low-pass filtered at 4 kHz.order to obtain a reliable estimate of the performance of aauditory display in critical situations, all speech signalsthe present experiments were bandlimited to 4 kHz. This wthe only restriction to the signals; no extra deterioratingfects such as speech coding were used.

I. HEAD-RELATED TRANSFER FUNCTIONS

A. Measurement of individual HRTFs

Prior to the speech intelligibility and talker recognitioexperiments, the HRTFs of each individual subject wemeasured. The HRTF measurement setup is situated inanechoic room and consists of a chair for the subject anrotatable arc with a movable trolley on which the sousource is mounted. The source is a Philips AD 2110/Smidrange tweeter with a frequency transfer of 0.25–20 kHThe distance from the source to the center of the arc is 1m. The stimulus is a computer-generated time-stretcpulse~Aoshima, 1981!, equalized to compensate for the nolinear response of the tweeter.

The sound reaching the subject’s ears was recordedtwo miniature microphones~Sennheiser KE-4-211-2! in afoam earplug and inserted into the ear canals. This blockear-canal method is different from measurements with miature probe microphones~open-ear-canal measuremen!,where the sound is recorded close to the eardrum. Genereither method can give good localization performan

2225rullman and A. W. Bronkhorst: Multichannel auditory displays

4

ahed-r

aeith

je

rn

tf

g-dhA

f

io

ethsu

.heis

plts

im

tdi

uariet

retht ayea-heh

erethere-eifi-ifi-e-

Fs toidu-ters

he

torly

ono-then-

ch

ur

rgettheere

ethe

~Wightman and Kistler, 1989a; Pralong and Carlile, 199Bronkhorst, 1995; Po¨sseltet al., 1986; Mo” ller et al., 1995;Hartung, 1995!. An advantage of the blocked-ear-canmethod is the stability of the microphone position during tentire measuring process~including the subsequent heaphone measurement!, and a better signal-to-noise ratio fohigh frequencies compared to using probe microphones.

A subject was seated on the chair in the center of the~reference position!. In order to assure the stability of thpositions measured, head movements were monitored whead tracker~Polhemus ISOTRAK!. If the deviation fromthe reference position was more than 1 cm or 5°, the subwas instructed~by means of auditory feedback! to adjusthis/her position. Angular deviations smaller than 5° wecompensated for by adjusting the position of the sousource.

A total of 965 positions were measured~more thanstrictly needed for the present experiments!, evenly dividedover 360° azimuth and all elevations above260° ~60° belowthe horizontal plane!, with a resolution of 5–6°. The tessignal was generated and recorded at a sampling rate okHz ~antialias filter at 18.5 kHz, 24 dB/oct roll-off! by a PCboard with two-channel AD/DA and a DSP32C floatinpoint processor. The signal had a duration of 10.2 ms, antime window was used in order to eliminate reflections. Tlevel of the signals at the reference position was 70 dBThe average of 50 signals per direction~played with 25-msintervals in one measurement! was adopted as the HRTF oeach ear.

After the free-field measurements the transfer functof the sound presented through headphones~Sennheiser HD530! was determined. As the transfer from headphone tois somewhat dependent on the specific placement ofheadphone on the subject’s head, ten headphone meaments were performed and the average~calculated in the dBdomain! was adopted as the headphone-transfer function

For the implementation in the 3D auditory display, tfree-field HRTFs for each subject were deconvolved by hher headphone transfer function and adapted to the samrate of 12.5 kHz to be used in the listening experimenEach HRTF was defined as a 128-tap convolution finitepulse response~FIR! filter.

B. General HRTFs

One of the goals of the present study was to assesseffect of the use of individualized HRTFs versus noninvidualized ~general! HRTFs for speech intelligibility andtalker recognition. Therefore, prior to the formal perceptexperiments described below, a pilot experiment was carout in order to find the best general HRTF set among a seeight ~i.e., HRTFs from eight different persons!. The selec-tion was based on a relatively difficult localization task, psenting four directions on the left and four directions onright that lie approximately on the cone of confusion, avirtual distance of 1.14 m~the same task was used bBronkhorst, 1995!. Using computer-generated pink noisstimuli ~50-kHz sampling, 18-kHz bandwidth, 500-ms durtion!, eight subjects had to indicate the direction of tstimuli by pressing labeled buttons on a hand-held box. T

2226 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

;

l

rc

a

ct

ed

50

ae.

n

arere-

/ing.-

he-

ld

of

-e

e

subjects were different from the persons whose HRTFs wused, and also different from those who participated inlistening experiments described in Secs. II and III. Thesults~in terms of absolute scores! showed the best set to havon average 53% correct localization. This was not signcantly higher than the other seven, but it did have a signcantly lower rate of front–back confusions. As a consquence, these HRTFs were selected as the general HRTbe used in the subsequent experiments. As with the indivalized HRTFs, the general set was implemented as FIR filwith 128 taps at 12.5-kHz sampling frequency.

Figure 1 gives examples of the individual HRTFs of t12 subjects that were used in the listening experiments~Secs.II and III! and of the general HRTF. The two panels refertwo different azimuths in the horizontal plane. Particulaup to 4 kHz ~the upper frequency in the experiments!, allHRTFs are quite similar.

II. SPEECH INTELLIGIBILITY IN A MULTITALKERENVIRONMENT

A. Speech material

The experiment on the effect of presentation modespeech intelligibility consisted of two parts: one with monsyllabic words and one with short sentences. In this way,influence of the redundancy~absence or presence of a meaingful context! could be assessed.

1. Words

The word material consisted of 192 meaningful Dutconsonant–vowel–consonant~CVC!-syllables ~Bosman,1989!. Recordings of the CVC syllables were made with fomale talkers and one female talker~25–45 years old!. Therecordings were made on digital audiotape~DAT! in ananechoic room. One of the male talkers was used as tatalker throughout the experiment, i.e., it was the task ofsubjects to understand what he said. The other talkers w

FIG. 1. Examples of 12 individual HRTFs~thin lines! and the generalHRTF ~heavy line! for the right ear in two different azimuth angles. Thdotted vertical line marks the upper frequency of 4 kHz employed inpresent study.

2226rullman and A. W. Bronkhorst: Multichannel auditory displays

a

el

e12nee

elrAatch

s

edeod

f 6

ecear

ec12

tsu

ow

it

ge

nsIn

thdeinth

ernthr’thp

nerst to

r-

ds/s torethen-

uli

t,ub-

nta-

ithwn;dian

competing talkers. Target and competing speech wereways selected from the same set of CVC syllables.

The digital output from the DAT was stored via an AriDAT link into separate computer files~48-kHz samplingrate, 16-bits resolution!. In order to equalize the levels of thdifferent talkers, the words were grouped in 16 lists ofand the A-weighted speech level of each list was determi~i.e., speech parts that were more than 14 dB below plevel were discarded, cf. Steeneken and Houtgast, 1986!. Asthis method is not applicable for single words, the levwere calculated for 12 concatenated words. The single wowere then rescaled as to have the desired speech level.remaining variation in level between the words is to betributed to the normal variation found in everyday speeFinally, the words were downsampled to 12.5 kHz~usingstandardMATLAB software, including appropriate low-pasfiltering!, and digitally low-pass filtered at 4 kHz.1

2. Sentences

The sentence material for the target talker consisted540 everyday sentences of eight to nine syllables. They wread by a trained male talker in a soundproof room, recordirectly onto computer hard disk, using a sampling rate44.1 kHz, with a 16-kHz antialiasing low-pass filter, an16-bits resolution.

The sentences for the competing talkers consisted osimilar sentences~Plomp and Mimpen, 1979!, read by threemale and one female talker. Recordings of these talkers wmade on DAT in a soundproof room; the single sentenwere stored in separate files via DAT link. For both the tget talker and the competing talkers~ages 30–45!, all sen-tences were individually equalized with respect to spelevel. Subsequently, all sentences were downsampled tokHz and low-pass filtered at 4 kHz.

B. Experimental design

The listening experiment consisted of two identical tesviz.one with the words and one with the sentences as stimEither listening test was set up in order to assess the folling aspects:

~1! Presentation mode, i.e., monaural, binaural, and 3D windividual or general HRTFs;

~2! Number of competing talkers;~3! In case of 3D presentation, the positions of the tar

talker and the competing talker~s!.

Figure 2 shows a survey of all experimental conditioThe number of competing talkers varied from one to four.the monaural presentation, this led to four conditions. Inbinaural condition a selection of six conditions was mawith two varieties in the cases of three and four compettalkers. In the 3D presentations, the possible positions oftalkers were five directions in the horizontal plane:290°,245°, 0°, 45°, or 90° azimuth, where a negative sign refto the left-hand side and a positive sign to the right-haside. Thus, the talkers were placed in a virtual space onfront horizontal semicircle around the listener. The talkeutterances were mixed for presentation to the listeners; incase of 3D, mixing was done after filtering each source se

2227 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

l-

dak

sdsny-.

ofredf

5

res-

h.5

,li.-

h

t

.

e,ge

sde

se

a-

rately with the appropriate HRTF. While varying the positioof the target talker, the positions of the competing talkwere chosen in such a way that the angle from targenearest competing talker was either 45°, 90°, or 135°~in onecase, 180°!. Figure 2 only displays conditions where the taget talker is in the right quadrant.

In total, 4 ~monaural!16 ~binaural!122 ~3D individual-ized HRTFs!122 ~3D general HRTFs!554 conditions wereconsidered per listening test. In each condition, 10 worsentences were presented. There were not sufficient wordhave a different stimulus per condition; most words wepresented twice and some three times in order to meetrequired 540 CVC stimuli. Single words have a low redudancy~and thus a low chance of being remembered!, whichis different for the sentences, of which 540 separate stimexisted.

C. Procedure

A total of 12 students from the University of Utrechwith ages ranging from 20 to 26 years, participated as s

FIG. 2. Survey of the conditions for monaural, binaural, and 3D presetion ~top view of subject looking ahead!, with T as position of the targettalker and 1–4 as positions of the competing talkers. Only conditions wthe target talker in the frontal position or in the right quadrant are shoconditions for the left quadrant were created by mirroring across the meaxis.

2227rullman and A. W. Bronkhorst: Multichannel auditory displays

,---Dor

FIG. 3. Mean scores for words~panelA! and sentences~panel B! as a func-tion of number of competing talkerswith presentation mode as a parameter. The curves for 3D give the averages of the different talker configurations; the hatched area around the 3curves indicates the range of scores fthe talker configurations~based on thepooled data for 3D with general andindividual HRTFs!.

nof

elfthntlk

o-tio,keallo

atinnctintug

ecuu

nga

aneomcted

do

gb-

l ofea-

k off them-end-nythe

the

ce-ions

ede re-re-per-of

foranderse to-g,

nodi-in-en-

nhein-

ts forns

jects. They did not report any hearing deficits and wererecently exposed to loud noises. The subjects were paidtheir services.

Subjects were tested in a soundproof room in two ssions, one for the words and one for the sentences. Hathe subjects started with the words, the other half withsentences. For each session and each subject a differequence was made for the mode of presentation and taconfiguration~Fig. 2!. The order in which the stimuli werepresented was fixed; the sequence of the presentation mwas varied according to a 434 Latin square, to avoid possible order and learning effects. Within each presentamode the talker configurations were pseudorandomizedthe sense that trials with a fixed position for the target talwere presented in blocks and the number of competing ters was assigned at random. The number of trials per bvaried from 60 to 90, depending on the 3D condition~cf. Fig.2, with ten words/sentences per condition!. Competing talk-ers 1, 3, and 4 were men; competing talker 2 was a womThis numbering corresponds to the order in which competalkers were added. The words/sentences they pronouwere randomly selected in such a way that no two competalkers pronounced the same word/sentence, which inwere always different from the one pronounced by the tartalker.

At the beginning of a block with a new position for thtarget talker, subjects would first hear three words/sentenfrom the target talker alone. This was done to make the sject aware of the target talker’s position, so that he/she wofocus on that voice and that position during the followitrials in a block. These three words/sentences were alwthe same and did not occur as test trials.

Subjects received the target talker in the right quadrin the word test and in the left quadrant in the sentence tor vice versa. In 3D, the competing talkers could come frall directions, as shown in Fig. 2. In terms of Fig. 2, subjegot either conditions shown there or conditions mirroracross the median axis.

The stimuli were generated by two PC sound boareach with a DSP32C processor and a two-channel DA cvertor. The outputs from the sound boards were mixed~sepa-rately for left and right!, led through a 4.5-kHz antialiasinlow-pass filter~Krohn-Hite 3342! and presented to each su

2228 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

tor

s-ofese-er

des

ninrk-ck

n.gedgrnet

esb-ld

ys

tst,

s

s,n-

ject through Sennheiser HD530 headphones. The levepresentation was approximately 65 dBA, as verified by msurements with an artificial ear~Bruel & Kjaer 4152!.

Each stimulus was presented only once, and the tasthe subject was to reproduce the word or the sentence otarget talker without a single error. Target talker and copeting talkers started speaking at the same moment. Deping on the particular set of stimuli, the offset synchrowould vary up to a few hundred ms. No feedback as tocorrectness of the subject’s responses was given. Beforeactual test, a practice round with 18 stimuli~from the sametarget talker, but with items that did not occur in the test! waspresented in order to familiarize the subjects with the produre. The practice round consisted of binaural presentatand 3D presentations with general HRTFs.

D. Results

For each condition the percentage of correctly receivwords/sentences was scored. Subsequent analysis of thsults was done by means of an analysis of variance forpeated measures. Unless specified otherwise, tests wereformed at the 5% significance level. For the sakeconvenience, we will abbreviate the different conditionsmonaural, binaural, and 3D presentation as mon, bin,3D, respectively, and write the number of competing talkand the talker configuration in parentheses, with referencFig. 2. For example, mon~3! stands for monaural with 3 competing talkers; bin(4B) stands for binaural with 4 competintalkers, configurationB; 3D(1E) stands for 3D presentation1 competing talker, configurationE.

First of all, for both words and sentences there weresignificant differences between the 3D conditions for invidualized and general HRTFs. In view of the similaritythe HRTFs~cf. Fig. 1!, this was not surprising. For the subsequent statistical analysis, the results of the two 3D prestations per condition were averaged.

Figure 3 shows the mean percentage of words~panel A!and sentences~panel B! correct for the different presentatiomodes as a function of the number of competing talkers. Tscores decrease as the number of competing talkerscreases, as expected. For binaural presentation, the resulthree and four competing talkers only refer to conditio

2228rullman and A. W. Bronkhorst: Multichannel auditory displays

t-ti

en

ee

orr1

infonlymil-

susb

rio

alareeiehnteu

thf tsen

ge

ero-

seederf

lkerentetplesixby

r,re-

bethefor

ntasnta-

chre-ri-tingingkerted,s. It

Onme

eret’scon-veectsnot,ere

timere-

tions ofonethe

oretar-er.

ithticebin-as

g-et

bin(3B) and bin(4B), respectively. It appeared that bin(3A)and bin(4A) showed very high scores~at least 92% correcfor words!, similar to bin(1A). Hence, extra competing talkers in the ear opposite the target talker have no negaeffect on intelligibility.

The data for 3D have been averaged over the differtalker configurations. These average scores are represeby the black filled circles~individual HRTFs! or squares~general HRTFs! which are connected by a solid line. Thhatched area around the 3D curves indicates the rangscores~averaged over individual and general! for the differ-ent talker configurations. On average, 3D presentation scfor sentences are 83% and 51% in case of two and thcompeting talkers, respectively, compared to 43% andfor binaural presentation.

The effects of presentation mode, number of compettalkers, and their interaction are significant. The trendswords and sentences are quite comparable: significahigher intelligibility scores for 3D presentation, particularwith two and three competing talkers. Even with four copeting talkers the benefit of 3D is there, although intelligibity remains rather low. As illustrated by the hatched areaFig. 3, the results with 3D presentation differ for the variotalker configurations. A discussion of these results canfound in the Appendix.

III. TALKER RECOGNITION IN A MULTITALKERENVIRONMENT

A. Speech material

The speech material for the talker-recognition expement consisted of ten sentences which had been readfrom Dutch newspaper articles by 12 male and 4 femtalkers. The sentences were relatively long fragments, ving from 3.9 to 7.1 s with an average of 6.3 s. They had brecorded directly onto the computer hard disk via an ArProport 565, sampled at 16 kHz with 16-bits resolution. Tsentences were digitally low-pass filtered at 4 kHz adownsampled to 12.5 kHz. In the same way as for the inligibility experiment, all sentences were equalized individally with respect to their A-weighted speech levels.

From the 12 male talkers, two were selected to betarget talkers. One of them had served as target talker oCVC syllables in the previous intelligibility experiment. Athe performance of talker recognition is expected to depon the particular talker, the second talker was selectedhave different objective~long-term average spectrum! andsubjective characteristics~vocal effort, monotony, speakinrate!. The target talkers read out ten extra sentences that wused to familiarize the subjects with the particular talk~‘talker-familiarization material’!. These sentences were prcessed in the same way as the test material.

B. Experimental design

The experimental conditions were identical to those uin the intelligibility experiment. All ten sentences were usonce in each condition. With respect to the relevant expmental factors employed~presentation mode, number ocompeting talkers, talker configuration in 3D!, the question

2229 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

ve

ntted

of

esee%

gr

tly

-

in

e

-ut

ey-nledl--

ehe

dto

res

d

i-

was whether the subjects could recognize the target taand, if present, localize him among a number of concurrtalkers. The identification—or rather verification: targtalker present or not—is actually no more than a simyes/no detection task. Over all experimental conditions,out of the ten sentences per condition were pronouncedthe target talker and four by a different talker.

Apart from recognizing and localizing the target talkethe time listeners needed to come to their decision wascorded as well. In this way an extra differentiation couldmade between the conditions, making it possible to testhypothesis that 3D presentation would demand less timerecognition.

C. Procedure

The same 12 subjects as in the intelligibility experimeparticipated in this experiment. The order of the stimuli wrandomized for each subject. The sequence of the presetion modes was varied according to 434 Latin squares,which were different for the two target talkers. Within eapresentation mode the talker configuration and left/right psentation were randomized. As in the intelligibility expement, competing talkers 1, 3, and 4 were men and competalker 2 was a woman. For each trial, the male compettalkers were randomly selected from ten, the female talfrom four. In the case that the target talker was not presena substitute was also selected out of the ten male talkerwas not excluded that more than one talker~either target orcompeting talker! would pronounce the same sentence.each trial all talkers started speaking at virtually the satime; mutual delays were within 20 ms.

The setup and equipment for generating the stimuli wthe same as for the intelligibility experiment. The subjecresponses were registered by means of a button boxnected to a Tucker Davis PI2 module. This box had fi‘‘target buttons’’ for the five different directions and onbutton designated as ‘‘not present.’’ The task of the subjewas to decide whether the target talker was presented orand to press the appropriate button as soon as they wconfident. Thus, the response/direction and the reactionwere registered for each condition. Measurement of theaction time started at the onset of the target sentence.

The subjects were not informed about the presentamode, nor was any feedback given as to the correctnestheir response. The experiment was run in two sessions,session per target talker. Half of the subjects started withfirst target talker, the other half with the second one. Befthe actual session, subjects had to listen carefully to theget talker, in order to ensure recognition of the correct talkA voice-familiarization list of ten sentences~not included inthe test! was presented twice, monaurally and via 3D wgeneral HRTFs. After that, subjects were given a pracround of 48 sentences with competing talkers, presentedaurally and in 3D. During this practice round, feedback wgiven.

D. Results

As it was not the aim to draw conclusions on the reconizability of one particular talker, the data of both targ

2229rullman and A. W. Bronkhorst: Multichannel auditory displays

ltnb

bjonto

adth

-see

onlliraif-ro

emrfo

Du

coca

fo-ing

-for-

sewasfor

thend

3Dting

ng

r-areget

seand

lo-re-thatctsore

thepre-de-o

tol.on-

ctisrv

tionrves

talkers were pooled before any further analysis of the resuThis also has the advantage that more data points per cotion were available, which increases the validity of the susequent data processing. From the 12 subjects, one sulost track of the target talker during the second test andsubject had been pressing continuously on one of the butduring the first test. Their data were discarded.

1. Recognition

The examination of the recognition scores was doneter the raw responses in each condition were transformean unbiased percentage of correct responses. This meuses the theory of signal detection~MacMillan and Creel-man, 1991; Gescheider, 1997! for estimating the true recognition scores, i.e., corrected for guessing. The unbiascores were used for the subsequent analyses of variancrepeated measures.

Figure 4 shows the recognition scores~direction not nec-essarily correct! for the four presentation modes as a functiof the number of competing talkers. Again, as in the integibility experiment, the results for individualized and geneHRTFs in the 3D conditions did not show a significant dference. The effects of presentation mode and numbecompeting talkers are significant, but their interaction is nCompared to intelligibility~Fig. 3!, the effect of competingtalkers is~far! less pronounced for talker recognition. Thslopes of the four curves are virtually identical. Going froone competing talker to four, the average decrease in pemance is from about 88% to 69%.

Because of the virtually identical scores for the two 3presentation modes, the results for general and individHRTFs were pooled and the unbiased percent-correct renition per condition was recomputed prior to the statistianalysis. Subsequentpost hoctesting~Tukey HSD! of bothfactors revealed a slightly better performance for 3D thanmonaural and binaural~81% vs 74/75%!, and a gradual, significant decrease when going from one to two compettalkers ~86% to 80%! and on to three or four competin

FIG. 4. Mean recognition scores for four presentation modes as a funof the number of competing talkers. The curves for 3D give the averagethe different talker configurations; the hatched area around the 3D cuindicates the range of scores for the talker configurations~based on thepooled data for 3D with general and individual HRTFs!.

2230 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

s.di--ectens

f-tood

dfor

-l

oft.

r-

alg-l

r

g

talkers~72% or 68%!. Overall, the configuration of the talkers in 3D presentation has no effect on the listeners’ permance~see the Appendix!.

2. Localization

In order to analyze the localization scores, only thoresponses were considered in which the target talkerindeed presented and recognized by the subject. Alsolocalization, no significant difference was found between3D conditions with individualized and general HRTFs, athe averages of the two were used for further analysis.

Figure 5 shows the localization performance for thepresentation modes as a function of the number of competalkers. The main effect is significant, andpost hocanalysis~Tukey HSD! revealed significantly decreasing steps goifrom two to four competing talkers~57% down to 43%!. Aswith recognition~Fig. 4!, the decrease in localization perfomance is quite gradual. Overall, the localization scoresrather poor, with relatively best performance when the tartalker is at 0° azimuth~see the Appendix!.

3. Reaction times

Analysis of the reaction times was performed on thoresponses where the target talker was indeed presentedrecognized by the subject, but not necessarily correctlycalized. Before going into the statistical analysis of thesults, there is one aspect of the experimental designshould be mentioned first. The task given to the subjeconsisted in fact of two subtasks which were executed mor less simultaneously. The subtasks were~1! recognizingthe target talker and, if recognized,~2! determining his loca-tion. Pressing a button could only be done as soon assecond subtask was accomplished. In case of monauralsentation, this boils down to a simple yes/no task, sincetermining the location is trivial~all talkers are presented tone ear!. There is a random factor of presenting the signalsthe left or right ear, which would make it slightly less triviaWith binaural presentation there are three buttons to be csidered~290°, 90°, and not present!, and with 3D presenta-

onofes

FIG. 5. Average localization scores for 3D presentation modes as a funcof the number of competing talkers. The hatched area around the cuindicate the range of scores for the different talker configurations~based onthe pooled data for 3D with general and individual HRTFs!.

2230rullman and A. W. Bronkhorst: Multichannel auditory displays

lt

rrepo

toas

rgu-enizo

thoum

Din

st

uras

ath

ogsewavirsthomtio

be-tion3Din-

tingurms-

ker

ionin-

tovern-

m-tybet-ntceeda-.

lim-use

a-e

i--ontre-d

teion.

gn-io of

twoer-

leey-ichut

sr at

n-

fwnlowted

sindr

tion six buttons~five directions and not present!. It is knownthat reaction time increases as the number of response anatives increase~cf. Wickens, 1984!. This means that thereaction times of 3D willa priori be longer than those fomonaural and binaural presentation. Therefore, no dicomparison between the three presentation modes issible.

In order to get an estimate of the effect of havingmake a choice out of two, three, or six buttons on the retion times, a subset of the conditions was presented tonew subjects. This time the question was simply: Is the tatalker present or not? The button box contained only 2 btons assigned ‘‘yes’’ and ‘‘no.’’ Like in the original recognition test, subjects listened to the voice-familiarization stences and got a practice round in order to familiarthemselves with the procedure. The conditions tested csisted of the following subset: mon~2!, mon~4!, bin(1A),bin(3B), bin(4B), 3D(1A), 3D(2D), 3D(2F), 3D(3B),3D(4A), 3D(4B), and 3D(4C). For the 3D conditions, onlythose with general HRTFs were used.

The results of this experiment were compared withresults of the subset in the original experiment. It turnedthat the differences in reaction times were on average 660for monaural, 720 ms for binaural, and 1360 ms for 3These differences are merely a consequence of simplifythe task of the listener. It appears that this even appliemonaural presentation. A separate analysis of variance onreaction-time differences showed that monaural and binawere the same, but 3D was significantly higher. On the bof these results, it was concluded that thea priori longerreaction time for 3D presentation in the original test wasleast 600 ms. Therefore, 600 ms was subtracted fromoriginal 3D results, so that the reaction time of mere recnition was obtained, and a comparison between the pretation modes could be made. The correction of 600 msfixed for all 3D conditions, as there was no statistical edence that it depends on the number of competing talke

Figure 6 gives the corrected mean reaction times forfour presentation modes as a function of the number of cpeting talkers. There are significant effects of presenta

FIG. 6. Mean reaction times~with 600-ms correction for 3D! as a functionof the number of competing talkers. The curves for 3D give the averagethe different talker configurations; the hatched area around the curvecates the range of scores for the configurations~based on the pooled data fo3D with general and individual HRTFs!.

2231 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

er-

cts-

c-ixett-

-en-

ets

.gtohealis

te-n-s

-.e-n

mode, number of competing talkers, and the interactiontween them. An analysis of the main effects per presentamode revealed that the reaction times of binaural andincrease significantly as the number of competing talkerscreases. Further analysis~planned comparisons! showed 3Dto have the shortest reaction times for up to three competalkers. Compared to binaural presentation with two to focompeting talkers, 3D presentation gives on average 840-shorter reaction times. There is hardly an effect of 3D talconfiguration on the reaction times~see the Appendix!.

IV. DISCUSSION

The effects of monaural, binaural, and 3D presentatof bandlimited speech were investigated with respect totelligibility and talker recognition in situations with multipleconcurrent talkers. The main question in this paper waswhat extent a 3D auditory display can give benefits omonaural and binaural presentation in realistic, critical coditions. In summary, results show that for two or more copeting talkers, 3D presentation yields better intelligibilithan monaural and binaural presentation, and somewhatter and much faster talker recognition. Another importaresult is that there is no significant difference in performanbetween the use of individualized and nonindividualizHRTFs. This is true for all experimental findings in this pper: intelligibility, talker recognition, and talker localizationThe absence of a difference is probably due to the bandited nature of the speech materials and, consequently, theof low-pass filtered HRTFs. In addition, for spatial informtion in the horizontal plane, a detailed definition of thHRTFs is less critical for intelligibility and talker recogntion. Hawleyet al. ~1999! also did not find differences between intelligibility scores of sentences presented in the frhorizontal plane by means of loudspeakers or KEMARcordings, the latter in fact being a kind of nonindividualizeHRTFs. But, for localization they did find better absoluscores in the actual sound field than with virtual presentat

A. Intelligibility

In the monaural condition with only one competin~male! talker, relatively low scores for both words and setences are found. With an average speech-to-masker rat0 dB, a score for sentences of only 36%~all words correct!indicates that subjects had great difficulty separating themale voices. In a similar monaural test, Stubbs and Summfield ~1990! found a score of 57% for two concurrent matalkers. However, this score refers to the percentage of kwords reproduced correctly, not entire sentences, for whthe score would clearly be lower. A keyword score of abo45% was reported by Hawleyet al. ~1999! for monaural vir-tual listening, i.e., monaural listening to KEMAR recordingof two concurrent sentences produced by one male talke690°. Ericson and McKinley~1997! found a relatively highsentence-intelligibility score of 70%–75% for diotic presetation in pink noise@5–10 dB signal-to-noise ratio~SNR!#Festen and Plomp~1990! found an SRT for sentences oabout11 dB in the case of a male talker masked by his ovoice,viz., time-reversed speech. This implies a score be50% for a speech-to-masker ratio of 0 dB. It should be no

ofi-

2231rullman and A. W. Bronkhorst: Multichannel auditory displays

gete

ret

ci

er

aeae

atrts

iginenatio-rarioanlnintaomomtioilyt

anleinei

ogr

itnde.a

wdl

orst

-ty,arhe

2%,av-

ed

.ur

ralncee

ofrforth

val-c vssti-

dB.in

thatind,s isrs,ac-hatreoy,

de-igi-

andffer-

andre-

of

that much lower SRTs are normally obtained when tarand competing talker are of opposite sex or when modulanoise is used~Festen and Plomp, 1990; Hyggeet al., 1992;Peterset al., 1998!. The monaural results for two and mocompeting talkers are consistent with the steepness ofintelligibility curve near threshold~10%–15%/dB!. That is,with two competing talkers, the speech-to-masker ratio is23dB and the sentence score drops to 5%, eventually reduto 0% for additional competing talkers.

The binaural curves for three and four competing talkin Fig. 3 are based on the configuration bin(3B) andbin(4B), i.e., with two competing talkers at the same earthe target talker. There is no significant difference betwebinaural scores for three and four competing talkers, indicing that two competing talkers instead of one at the otherdoes not decrease performance. One can even state thpresence or absence of competing speech in the othedoes not influence intelligibility at all, in view of the resulfor the monaural conditions: comparing mon~1! withbin(2B) and mon~2! with bin(3B) or bin(4B) justifies thisconclusion.

We see that 3D presentation gives the highest intellbility for more than one competing talker. For one compettalker, similar results can be obtained with binaural prestation. The findings for one and two competing speakersessentially in agreement with results on spatial separafound by Yostet al. ~1996! for three and two sources, respectively. Although there are situations in which binaupresentation with three or four competing talkers is supe~viz., when only the target talker is presented to one earall competing talkers to the other ear!, one can in generaconclude, when drawing horizontal iso-intelligibility lines iFig. 3, that binaural presentation with two or three compettalkers yields about the same intelligibility as 3D presention with three or four competing talkers, respectively. S3D presentation allows for an extra competing talker copared to binaural presentation and for two more talkers cpared to monaural presentation. Moreover, 3D presentamakes it possible to follow any of the talkers more eas~although certain azimuths have a slight advantage, seeAppendix!.

In the present experimental design, the target talkereach of the competing talkers were given equal speechels. This means that with two, three, and four compettalkers, the speech-to-masker ratio decreases, on averag3, 4.8, and 6 dB, respectively. Apart from the increasebackground level, a second aspect plays a significant rviz., the change in spectro–temporal properties from a sinvoice to voice babble. Both factors increase the speechception threshold~cf. Festen and Plomp, 1990!. A number of3D configurations of our experiment can be compared wthe results for KEMAR recordings that Bronkhorst aPlomp ~1992! obtained from normal-hearing subjects, i.their conditions with sentences presented at 0° azimuthfluctuating speech noise at630° and690°. We used differ-ent sentence material~and a different talker! for the targetspeech and HRTFs instead of artificial-head filtering, butmay get some insight as to the effect of using real insteasimulated~modulated noise! speech. Correcting for the tota

2232 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

td

he

ng

s

snt-arthe

ear

i-g-

ren

lrd

g-,--n

he

dv-g, bynle,lee-

h

,nd

eof

level of competing speech, as mentioned above, Bronkhand Plomp found SRTs of220.0 and211.2 dB for configu-rations 3D~1G! and 3D~2G!, respectively. In view of the psychometric curve of speech-to-noise ratio vs intelligibiliwhich is also valid for fluctuating noise~Festen and Plomp1990!, these SRTs imply sentence-intelligibility scores ne100% for equally loud target and competing speech. Tscores found in the present study amount to 95% and 7respectively. Hence, one may tentatively conclude that hing two talkers instead of noises@configuration 3D(2G)#makes the task more difficult. This finding is corroboratwhen taking configuration 3D(4G), for which Bronkhorstand Plomp~with two maskers at630° instead of645° in thepresent case! come to a level-corrected SRT of24.0 dB,corresponding to a sentence intelligibility of about 65%2

This is much higher than the score of 16% we find with focompeting talkers.

If we consider the gain of 3D compared to monaupresentation in the above three conditions, we get differescores~for sentences! as shown in the third column of TablI. The same data for the fluctuating noise maskersBronkhorst and Plomp~1992! are shown relative to theiconditions of frontal presentation. That is, the conditionsKEMAR recordings of all sources recorded at 0° azimupresented binaurally to the listeners. The difference-SRTues are compensated for an assumed 1-dB gain for diotimonotic presentation. From these difference-SRTs, an emate is made of the expected intelligibility score~right col-umn!, under the assumption of a slope of 10%–15% perGiven the differences between the two studies, variancethe data, and the assumptions made, we may concludethe scores are more or less in line. One has to bear in mthough, that the use of multiple fluctuating noise maskernot always a good predictor for modeling multiple talkeparticularly when they are of the same sex. It does notcount for segregation of voices and for the distraction tmay occur due to simultaneous intelligibility of two or motalkers ~cf. Festen and Plomp, 1990; but see Duquesn1983!.

B. Talker recognition

For all presentation modes, the recognition scorespend less on the number of competing talkers than intell

TABLE I. Comparison of the differences in scores between monaural3D presentation for selected conditions in the present study with the dience in estimated scores~based on differences in SRT! between frontal andspatialized presentation of KEMAR recordings in a study by BronkhorstPlomp ~1992!. Maskers refer to competing talkers or modulated noise,spectively.

# Maskers Condition

Presentstudy

Diff. score

Bronkhorst and Plompa

Diff. SRT~dB!

Est. diff.score

1 3D(1G) 59% 7.0 .70%2 3D(2G) 67% 3.6 40%–55%4 3D(4G) 16% 0.9 10%–15%

aDifferences in SRT including a 1-dB threshold increase for diotic insteadmonaural presentation.

2232rullman and A. W. Bronkhorst: Multichannel auditory displays

rerau-

o

acuta

emthn

mth

ntec

atilllkoonitte

alf

llnf

tiorsltyesislv

ioveit

tenatio

allr o

azad

ott

low-es.of

ex-ntaltheen-t a

-re-zed46%ver-canpa-ion

eechr-

vir-

nsesen-Intalr

a80%arethehey

rese-

erieldthetialim-te

ings,

au-n-

as-n

bility scores. The recognition scores we found for 3D psentation are slightly higher than for monaural or binaupresentation. They vary from 89% with one to 71% with focompeting talkers~Fig. 4!. These relatively high values indicate that talker recognition is an easier task than wordsentence reception.

Recognizing the target talker may be described as wing for the moment that spectral and/or temporal dips ocin the competing voices. In a way this process is similarunderstanding a message, but for talker recognition it mend at an earlier stage, as one does not need to capturentire utterance of the target talker. As the number of copeting talkers increases, the background level andspectro–temporal ‘‘filling’’ increase, making recognitiogradually more difficult.

While the effect of recognitionper se ~target talkerpresent or not! appears not to be very dependent on the nuber of competing talkers, a different picture arises fortime needed for actual recognition~Fig. 6!. For both binauraland 3D presentation, reaction times increase significawhen the number of competing talkers increases. Subjapparently have to wait longer for the moment to ‘‘catchglimpse’’ of the target talker. In other words, listeners shave a good chance to be sure they heard the target tabut they take their time. The results on reaction times shthere is a clear release from masking when three or mvoices are spatially separated, yielding an average differeof over 800 ms compared with binaural presentation. Wonly one competing talker, 3D presentation with ultimaspatial separation, 3D~1A!, does not do better than binaurpresentation, bin(1A). The difference in reaction time oabout 150 ms, as found in the original test~after 600-mscorrection! and in the retest, appeared not to be statisticasignificant. In general, for one competing talker there issignificant difference between bin(1A) and the average othe 3D talker configurations~as plotted in Fig. 6!.

For the monaural presentation, an asymptote in reactime is already reached with two competing talkers. At figlance it looks like subjects probably had great difficurecognizing the talker in these conditions and more or l‘‘gave up’’ by pressing a button. But, this explanationcontradicted by the monaural recognition scores themse~66% or more correct, Fig. 4!, which are not significantlydifferent from the binaural scores. An alternative explanatmay be that recognition with monaural presentation invola relatively easy task of selective attention, whereas wbinaural or 3D presentation the listener uses divided attion ~cf. Yost et al., 1996!. It should be noted that reactiotimes for 3D presentation found in this study underestimthose that will occur in realistic situations, since the locatof the talker will then generally be known in advance.

C. Localization

Absolute localization of the target talker seems generpoor and becomes gradually more difficult as the numbecompeting talkers increases~Fig. 5!. Having a closed-response set with five alternatives, the localization scoreson average around 50%. The finding that even for localition the use of either individualized or nonindividualize

2233 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

-l

r

r

it-r

oythe-e

-e

lyts

er,wreceh

yo

nt

s

es

nshn-

en

yf

re-

HRTFs does not yield significantly different results is nreally surprising in view of the relatively low intersubjecvariability in HRTFs for frequencies below 4 kHz~cf. Fig. 1;see also Shaw, 1974; Mo” ller et al., 1995!. The bandlimitednature of the speech signals and the consequent use ofpass filtered HRTFs eliminates certain perceptual cuThese cues are primarily associated with the perceptionelevation and externalization~cf. Bronkhorst, 1995!. Theformer aspect is not particularly relevant in the presentperiments, as all sources were presented on the horizoplane. The latter point may however have played a role inrelatively poor localization scores, as some subjects mtioned hearing the voices close to their head and not adistance.

Begault and Wenzel~1993! studied horizontal-plane localization of wideband speech stimuli of a single talker psented to inexperienced listeners, using nonindividualiHRTFs and open-set responses. They report that up toof the stimuli were heard inside the head and found an aage error angle of 28°. They conclude that most listenersobtain useful azimuth information, the results being comrable with those for broadband noise stimuli. This concluswas also drawn by Ricard and Meirs~1994! for synthesized,bandlimited speech, and by Gilkey and Anderson~1995!,who used real sources and compared performance of spand click stimuli. The latter study only reports left–right erors for four subjects, which were on average 16°.

In studies with simultaneous speech sources in thetual horizontal plane, Hawleyet al. ~1999! found on average72% correct for 2–4 concurrent sentences, having 7 respoalternatives. In those experiments the listeners knew thetence in advance and only had to pinpoint its location.another study with bandlimited speech in the horizonplane, Yostet al. ~1996! had subjects localize all two othree words that were presented simultaneously. Withclosed-response set of seven azimuths, scores were overfor two talkers and about 60% for three talkers. Theserelatively high scores, but one has to keep in mind thatsubjects could listen to the presentation as often as tliked. Our score of 51%~three competing talkers! seemscomparatively low, but the task for our listeners was mocomplex, i.e., first detecting the target talker and subquently determining his location.

In summary, localizing a target talker among a numbof competing talkers when signals are bandlimited does yrelatively poor scores, but not to a dramatic extent inlight of the results reported in the literature. Besides, spaseparation as obtained in a 3D auditory display is moreportant for intelligibility and talker recognition than accuralocalization.

V. CONCLUSIONS

From the results of the experiments in this paper, usbandlimited ~4-kHz! speech signals and truncated HRTFthe following conclusions can be drawn:

~1! There is no difference in performance between a 3Dditory display based on individualized HRTFs and geeral HRTFs. This conclusion applies to all scoressessed for speech intelligibility, talker recognitio

2233rullman and A. W. Bronkhorst: Multichannel auditory displays

od

th

reigrisre

ar

ne

foceb

r-n

tr orsin

dm

dongtioth

i-3Dwhereeelner0°pefoee

en-of

ith

th ofnent;is

getareors,, itonsm-l ortheni-getffect

orrendtt ofthe

ngog-of

es,lym-e or

nsgetante.

ionnd

~including the time required for recognition!, and talkerlocalization. This means that no individual adaptationa bandlimited~4-kHz! communication system is needein a practical application of an auditory display wimany users.

~2! Compared to conventional monaural and binaural psentation, 3D presentation yields better speech intellbility with two or more competing talkers, in particulafor sentence intelligibility. Equivalent performanceachieved with 3D presentation compared to binaural psentation when one talker is added and comparedmonaural presentation when two or three talkersadded. However, in specific conditions~all competingtalkers on the side opposite the target talker! binauralpresentation may be superior to 3D. Within the 3D cofigurations examined, intelligibility is highest when thtarget talker is at245° or 45° azimuth.

~3! Talker-recognition scores are higher for 3D thanmonaural and binaural presentation, but the differenare small. Recognition scores depend less on the numof competing talkers than intelligibility scores. The vitual positions of the talkers in 3D are not a relevafactor.

~4! For binaural and 3D presentation, the time requiredcorrectly recognize a talker increases with the numbecompeting talkers. For two or more competing talke3D presentation requires significantly less time than baural presentation.

~5! Absolute localization of a talker is relatively poor anbecomes gradually more difficult as the number of copeting talkers increases.

ACKNOWLEDGMENTS

This work was supported by the Royal NetherlanNavy. We wish to thank Dr. Niek Versfeld and colleaguesthe Experimental Audiology Group at the ENT Departmeof the Free University Hospital in Amsterdam for providinpart of the sentence material. The comments and suggesof the three anonymous reviewers on earlier versions ofpaper are greatly appreciated.

APPENDIX: EFFECT OF TALKER CONFIGURATION IN3D PRESENTATION

In both the intelligibility and the recognition experments, the configuration of the talkers in the case ofpresentation was a factor in the design. In this Appendixwill discuss in more detail two situations with respect to tminimum angle between the target talker and the neacompeting talker,viz., 90° or 45°. These conditions will breferred to as 45°-tca and 90°-tca, respectiv(tca5target-competing talker angle!. The second aspect ithe talker configuration is the azimuth of the target talkwhich could have an absolute value of 0°, 45°, or 9Planned comparisons at a 5% significance level wereformed for the statistical analyses. Plots are shown onlythe cases for which relatively substantial differences betwconfigurations were found.

2234 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

f

-i-

-toe

-

rser

t

of,-

-

sft

nsis

e

st

y

,.r-rn

Figure A1 displays the mean scores for words and stences, in 90°-tca and 45°-tca as a function of the azimuththe target talker. The 90°-tca conditions only occurred wone or two competing talkers and for one azimuth~90°! withthree competing talkers~filled single diamond!. Both wordsand sentences show the same trend regarding the azimuthe target talker. In the 90°-tca conditions with only ocompeting talker, the target talker’s position is not relevawith two competing talkers, a significantly worse scorefound if the target talker is placed at 0°. The best tarazimuth is 45°. In the 45°-tca conditions highest scoresfound at 45° or 0° with one competing talker, at 45° with twcompeting talkers, at 45° or 0° with three competing talkeand at any azimuth with four competing talkers. Hencewould be best to place the target talker at 45°. Comparisof the results of 90°-tca and 45°-tca for one and two copeting talkers generally show that the scores are equahigher for 90°-tca, depending on speech material andazimuth of the target talker. Although there is no clear uform result, enlarging the minimum angle between tartalker and nearest competing talker can have a positive eon speech intelligibility~cf. Peissig and Kollmeier, 1997!.

For talker recognition and reaction times, no or mineffects of configuration are found, for which no plots ashown here. Only a small difference in recognition is foubetween 3D(3A) and 3D(3B). That is, when the targetalker is at 90°, there is a slight benefit to have the nearesthree competing talkers 90° apart. Comparison between90°-tca and 45°-tca conditions for one or two competitalkers shows no significant differences. Hence, talker recnition is not critically dependent on the spatial segregationtarget talker and competing talkers. As to the reaction timonly for 45°-tca with four competing talkers’s a significantshorter reaction time found for a target position of 90°. Coparison between the 90°-tca and 45°-tca conditions for ontwo competing talkers shows no significant differences.

Figure A2 shows the localization scores in conditio90°-tca and 45°-tca as a function of the azimuth of the tartalker. For both 90°-tca and 45°-tca there is a significeffect of target position, with 0° giving the best performanc

FIG. A1. Mean intelligibility scores for words and sentences as a functof the azimuth of the target talker, with number of competing talkers a45°- or 90°-tca as parameters.

2234rullman and A. W. Bronkhorst: Multichannel auditory displays

ny-tr

atit a

nd

-

D.

ng

’’

ith

J.

hus

-

nor-

c.

ionpeech

oc.

ired

for

’’ in

dof

J.

e

for

and

J.

ce

.

rgte

The distribution of the localization errors did not show aresponse bias toward 0°. Comparison between the 90°and 45°-tca conditions for one or two competing talkeshows no significant differences.

1Downsampling from 44.1 to 12.5 kHz was done by resampling with a rof 2/7. This actually results in a sampling rate of 12.6 kHz. Playing it arate of 12.5 kHz results in slight, negligible mismatch~0.8%!.

2We assume an SRT for sentences in noise of25 dB and a slope of thepsychometric function of 15% per dB around the 50% point~Plomp andMimpen, 1979!. Hence, an SRT of24 dB results in an intelligibility scoreof 65%.

Aoshima, N.~1981!. ‘‘Computer-generated pulse signal applied for soumeasurement,’’ J. Acoust. Soc. Am.69, 1484–1488.

Begault, D. R.~1995!. ‘‘Virtual acoustic displays for teleconferencing: Intelligibility advantage for telephone grade audio,’’Audio Engineering So-ciety 98th Convention Preprint 4008~AES, New York!.

Begault, D. R., and Wenzel, E. M.~1993!. ‘‘Headphone localization ofspeech,’’ Hum. Factors35, 361–376.

Begault, D. R., and Erbe, T.~1994!. ‘‘Multichannel spatial auditory displayfor speech communication,’’ J. Audio Eng. Soc.42, 819–826.

Bosman, A. J.~1989!. ‘‘Speech perception by the hearing impaired,’’ Ph.dissertation, University of Utrecht.

Bronkhorst, A. W., and Plomp, R.~1992!. ‘‘Effect of multiple speechlikemaskers on binaural speech recognition in normal and impaired heariJ. Acoust. Soc. Am.92, 3132–3139.

Bronkhorst, A. W.~1995!. ‘‘Localization of real and virtual sound sources,J. Acoust. Soc. Am.98, 2542–2553.

Cherry, E. C.~1953!. ‘‘Some experiments on the recognition of speech, wone and with two ears,’’ J. Acoust. Soc. Am.25, 975–979.

Crispien, K., and Ehrenberg, T.~1995!. ‘‘Evaluation of the cocktail-partyeffect for multiple speech stimuli within a spatial auditory display,’’Audio Eng. Soc.43, 932–941.

Duquesnoy, A. J.~1983!. ‘‘Effect of a single interfering noise or speecsource on the binaural sentence intelligibility of aged persons,’’ J. AcoSoc. Am.74, 739–943.

Ericson, M. A., and McKinley, R. L.~1997!. ‘‘The intelligibility of multipletalkers separated spatially in noise,’’ inBinaural and Spatial Hearing inReal and Virtual Environments, edited by R. H. Gilkey and T. R. Anderson ~Erlbaum, Mahwah, NJ!, Chap. 32, pp. 701–724.

Festen, J. M., and Plomp, R.~1990!. ‘‘Effects of fluctuating noise and in-

FIG. A2. Mean localization scores as a function of the azimuth of the tatalker, with number of competing talkers and 45°-or 90°-tca as parame

2235 J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000 R. D

cas

o

,’’

t.

terfering speech on the speech reception threshold for impaired andmal hearing,’’ J. Acoust. Soc. Am.88, 1725–1736.

Gescheider, G. A.~1997!. Psychophysics: The Fundamentals~Erlbaum,Mahwah, NJ!.

Gilkey, R. H., and Anderson, T. R.~1995!. ‘‘The accuracy of absolutelocalization judgments for speech stimuli,’’ J. Vestib. Res.5, 487–497.

Hartung, K. ~1995!. ‘‘Messung, Verifikation und Analyse vonAussenohru¨bertragungs-funktionen,’’ inFortschritte der Akustik—DAGA’95 ~DPG-GmbH, Bad Honnef, Germany!, pp. 755–758.

Hawley, M. L., Litovsky, R. Y., and Colburn, H. S.~1999!. ‘‘Speech intel-ligibility and localization in a multi-source environment,’’ J. Acoust. SoAm. 105, 3436–3448.

Hygge, S., Ro¨nnberg, J., Larsby, B., and Arlinger, S.~1992!. ‘‘Normal-hearing and hearing-impaired subjects’ ability to just follow conversatin competing speech, reversed speech, and noise backgrounds,’’ J. SHear. Res.35, 208–215.

MacMillan, N. A., and Creelman, C. D.~1991!. Detection Theory: A User’sGuide ~Cambridge University Press, Cambridge!.

Mo” ller, H., So”rensen, M. F., Hammersho” i, D., and Jensen, C. B.~1995!.‘‘Head-related transfer functions of human subjects,’’ J. Audio Eng. S43, 300–320.

Peissig, J., and Kollmeier, B.~1997!. ‘‘Directivity of binaural noise reduc-tion in spatial multiple noise-source arrangements for normal and impalisteners,’’ J. Acoust. Soc. Am.101, 1660–1670.

Peters, R. W., Moore, B. J. C., and Baer, T.~1998!. ‘‘Speech receptionthresholds in noise with and without spectral and temporal dipshearing-impaired and normally hearing people,’’ J. Acoust. Soc. Am.103,577–587.

Plomp, R., and Mimpen, A. M.~1979!. ‘‘Improving the reliability of testingthe speech reception threshold,’’ Audiology18, 43–52.

Pollack, I., Pickett, J. M., and Sumby, W. H.~1954!. ‘‘On the identificationof talkers by voice,’’ J. Acoust. Soc. Am.26, 403–406.

Posselt, C., Schro¨ter, J., Opitz, M., Divenyi, P. L., and Blauert, J.~1986!.‘‘Generation of binaural signals for research and home entertainment,Proceedings of the 12th International Congress on Acoustics~Toronto,Canada!, Vol. 1, B1–6~Beauregard Press, Canada!.

Pralong, D., and Carlile, S.~1994!. ‘‘Measuring the human head-relatetransfer functions: A novel method for the construction and calibrationa miniature in-ear recording system,’’ J. Acoust. Soc. Am.95, 3435–3444.

Pralong, D., and Carlile, S.~1996!. ‘‘The role of individualized headphonecalibration for the generation of high fidelity virtual auditory space,’’Acoust. Soc. Am.100, 3785–3793.

Ricard, G. L., and Meirs, S. L.~1994!. ‘‘Intelligibility and localization ofspeech from virtual directions,’’ Hum. Factors36, 120–128.

Shaw, E. A. G.~1974!. ‘‘Transformation of sound pressure level from thfree field to the eardrum in the horizontal plane,’’ J. Acoust. Soc. Am.56,1848–1861.

Steeneken, H. J. M., and Houtgast, T.~1986!. ‘‘Comparison of some meth-ods for measuring speech levels,’’ Report IZF 1986-20, TNO InstitutePerception, Soesterberg, The Netherlands.

Stubbs, R. J., and Summerfield, Q.~1990!. ‘‘Algorithms for separating thespeech of interfering talkers: Evaluations with voiced sentences,normal-hearing and hearing-impaired listeners,’’ J. Acoust. Soc. Am.87,359–372.

Wenzel, E. M., Arruda, M., Kistler, D. J., and Wightman, F. L.~1993!.‘‘Localization using nonindividualized head-related transfer functions,’’Acoust. Soc. Am.94, 111–123.

Wickens, C. D.~1984!. Engineering Psychology and Human Performan~Merrill, Columbus, OH!, Chap. 10, pp. 337–376.

Wightman, F. L., and Kistler, D. J.~1989a!. ‘‘Headphone simulation offree-field listening: I. Stimulus synthesis,’’ J. Acoust. Soc. Am.85, 858–867.

Wightman, F. L., and Kistler, D. J.~1989b!. ‘‘Headphone simulation offree-field listening: II. Psychophysical validation,’’ J. Acoust. Soc. Am85, 868–878.

Yost, W. A., Dye, R. H., and Sheft, S.~1996!. ‘‘A simulated cocktail partywith up to three sound sources,’’ Percept. Psychophys.58, 1026–1036.

etrs.

2235rullman and A. W. Bronkhorst: Multichannel auditory displays


Recommended