of 80
8/3/2019 Wa Ms Thesis
1/80
Investigation of Voice Stage Support: Subjective Preference Test
Using an Auralization System for Self-Voice
by
Cheuk Wa Yuen
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the degree of
MASTER OF SCIENCE IN BUILDING SCIENCES,
CONCENTRATION IN ARCHITECTURAL ACOUSTICS
Approved:
_________________________________________Professor Paul T. Calamia, Thesis Adviser
_________________________________________Professor Ning Xiang, Ph.D.
Rensselaer Polytechnic InstituteTroy, New York
June, 2007(For Graduation August 2007)
8/3/2019 Wa Ms Thesis
2/80
ii
Copyright 2007
by
Cheuk Wa Yuen
All Rights Reserved
8/3/2019 Wa Ms Thesis
3/80
iii
CONTENTS
LIST OF TABLES.........................................................................................................v
LIST OF FIGURES...................................................................................................... vi
ACKNOWLEDGMENT .............................................................................................. ix
ABSTRACT................................................................................................................. xi
1. Introduction..............................................................................................................1
1.1 Aim of the Thesis.............................................................................................3
1.2 Historical Review ............................................................................................5
1.2.1 Stage Acoustics and Support................................................................5
1.2.2 Previous Researches on Subjective Preferences on Stage Acoustics .....7
1.2.3 Self-Voice Perception..........................................................................7
1.2.4 Experimental Setup in Previous Self-voice Auralization and RelatedSound Field Simulation......................................................................10
1.3 Thesis Outline................................................................................................12
2. Self-voice Auralization System: Design and Implementation..................................13
2.1 Experimental Design Concept ........................................................................13
2.2 Measurement Setup........................................................................................13
2.3 Binaural Real-time Auralization System ........................................................15
2.3.1 System Overview...............................................................................15
2.3.2 Implementation of Direct Air Conduction Modeling ..........................19
2.3.3 Implementation of Indirect Air Conduction Modeling........................23
2.3.4 Implementation of Headphone Equalization.......................................25
2.4 BRIR Acquisition System ..............................................................................26
2.5 Experimental Procedures in Subjective Test...................................................30
2.5.1 Test Subject Conditioning..................................................................30
2.5.2 Use of Dramatic Text in Study of Actors............................................30
2.5.3 Verifying the Consistency of Self-voice Stimuli by Monitoring the
Pace of Speech...................................................................................31
8/3/2019 Wa Ms Thesis
4/80
iv
2.6 HATS Verification Tests................................................................................34
2.6.1 Binaural Microphones........................................................................35
2.6.2 Artificial Mouth.................................................................................36
2.7 Subjective Test on Naturalness of the Auralization System ............................41
2.7.1 Evaluation on Naturalness of CIL Filter Delay Time..........................42
2.7.2 Evaluation on Naturalness of CIL Filter Level ...................................45
2.8 Discussion .....................................................................................................46
3. Subjective Preference Tests on Stage Acoustic Conditions for Actors.....................47
3.1 Introduction ...................................................................................................47
3.2 Impulse Response Acquisition .......................................................................48
3.3 Subjective Test Design...................................................................................50
3.3.1 Preference ratings from paired comparison ........................................50
3.3.2 Test procedures..................................................................................51
3.4 Paired Comparison Test on Stage Locations...................................................51
3.4.1 Preference study of stage locations when head orientation is look
center ...............................................................................................52
Preference study of stage locations when head orientation is look left ........59
3.4.2 Preference study of stage locations when head orientation is look
right .................................................................................................65
3.4.3 Discussion on Stage Location Preference...........................................70
4. DISCUSSIONS ......................................................................................................71
4.1 Reflections on subjective preferences on stage acoustic conditions.................71
4.2 Accuracy in subjective testing........................................................................71
4.3 Potential ways of improving voice stage support in proscenium theaters ........72
8/3/2019 Wa Ms Thesis
5/80
v
LIST OF TABLES
Table 1. Direct AC Filter Settings (Rane PE-15)...........................................................20
Table 2. CIL filter settings (digital parametric equalizer on 02R)..................................23
Table 3. Headphone compensation filter settings (02R master output) ..........................26Table 4. Delay time in evaluation test of naturalness of Direct AC insertion loss
compensation filter (CIL filter). *The current system has a processing delay of
0.14ms with a setting of 0.01ms on DN716. ** Prschmann's tested delay times
were based on taps of a 48kHz Tucker Davis DSP system, and they are represented
here in millisecond for convenience of comparison...............................................43
8/3/2019 Wa Ms Thesis
6/80
vi
LIST OF FIGURES
Figure 1. Components of perception of self-voice...........................................................9
Figure 2 Earthwork M30 omni-directional microphone.................................................14
Figure 3 Countryman B3 miniature omni-directional microphone.................................14Figure 4 Transfer function from Earthwork M30 to Countryman B3.............................15
Figure 5 Binaural self-voice auralization system block diagram....................................16
Figure 6 Test subject with microphone and pop filter....................................................17
Figure 7 MRP-to-ERP Transfer Function & Direct AC Filter using PE-15 parametric
equalizer (1/3 octave smoothing) ..........................................................................20
Figure 8 Setup for measuring the headphone's insertion loss using an isolation tube. ....21
Figure 9 Isolation tube used in insertion loss measurement ...........................................22
Figure 10 Waves IR1 Convolution Reverb, loaded with a 3-second unit sample sequence
.............................................................................................................................24
Figure 11 Impulse response trimming before importing to IR-1....................................24
Figure 12 Headphone response and compensation filter (02R master output)................25
Figure 13 Frontal plane section of HATS, showing binaural microphones and the related
fittings. (Adapted from manufacturers manual)....................................................27
Figure 14 Median plane section of HATS, showing the artificial mouth .......................28
Figure 15 Binaural room impulse response acquisition system block diagram, showing
how the binaural ears and artificial mouth of the HATS are connected..................29
Figure 16 Example plot of effective duration of running autocorrelation function of two
recordings of the same text at different pace, showing the use of e min as a temporal
reference for monitoring subjective testing............................................................33
Figure 17 Verification Test of HATS in anechoic chamber at General Electric
Laboratory (NY)...................................................................................................34
Figure 18 Frequency response comparison between HATS binaural microphones ........35Figure 19 Voice directivity of HATS (a) horizontal plane (b) vertical plane, in 4 octave
bands (250Hz, 500Hz, 1khz & 2kHz)....................................................................36
Figure 20 Comparison of voice directivity, in 3 ocatve bands (500Hz, 1kHz & 2kHz)..37
Figure 21 On-axis frequency response of HATS artificial mouth. Overlays of no-
smoothing and 1/6-octave smoothing...............................................................38
8/3/2019 Wa Ms Thesis
7/80
vii
Figure 22 Frequency Response of B&K Artificial Mouth Type 4128C [adapted from
manufacturers datasheet] .....................................................................................39
Figure 23 MRP-to-ERP transfer function of HATS (1/3-octave smoothing)..................40
Figure 24 Averaged frequency response of MRP-to-ERP (Direct AC) of 18 human
subjects. The grey area marks the satndard deviation. (Adapted from Porschmann
[2000])..................................................................................................................41
Figure 25 Subjective evaluation of naturalness of delay time in Direct AC auralization.
Mean score and error rate of all subjects (95% confidence)...................................44
Figure 26 Architectural plan of the main space at the RPI Playhouse. Dimension unit in
inches. Blue lines are dimensional guides. (CAD drawing courtesy of RPI Building
Management)........................................................................................................48
Figure 27 Stage locations where BRIR was measured. Dashed line labeled "CL" is the
center line of the stage across the proscenium. ......................................................49
Figure 28 Top view of the HATS showing 3 different head orientations in binaural room
impulse response acquisition.................................................................................50
Figure 29 Preference score of different stage locations when head orientation is "look
center" (Conditions - A: DSC, B: DSR, C:CSC, D: CSR). Normalized scores of 13
individual subjects A-M (blue bar graph) and overall average score of all subjects
(red bar graph)......................................................................................................55
Figure 30 Interaural cross-correlation functions in 100ms-intervals for conditions A-D
when head orientation is "look center". .................................................................56
Figure 31 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D:
CSR) when head orientation is "look center".........................................................58
Figure 32 Preference score of different stage locations when head orientation is "look
left" (Conditions - A: DSC, B: DSR, C: CSC, D: CSR). Normalized scores of 13
individual subjects A-M (blue bar graph) and overall average score of all subjects
(red bar graph)......................................................................................................62
Figure 33 Interaural cross-correlation functions in 100ms-intervals for conditions A-D
when head orientation is "look left". .....................................................................63
Figure 34 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D:
CSR) when head orientation is "look left.............................................................64
8/3/2019 Wa Ms Thesis
8/80
viii
8/3/2019 Wa Ms Thesis
9/80
ix
ACKNOWLEDGMENT
I am grateful for professor Paul Calamia and his willingness to share his knowl-
edge and wisdom. I would also thank Dr Ning Xiang for his insights and meticulous
training in laboratory work and research; and Dr Jonas Braasch for an enjoyable class inpsychoacoustics.
This research would not be possible without the help from Mr. David Larson for his
generosity in lending me the Brel & Kjr head and torso simulator. My gratitude also
goes to Mr Bob Hedeen at General Electric Laboratory (NY) for letting me use the an-
echoic chamber for numerous acoustical measurements.
A thunderous applause goes to all participating actors in this research. Speaking in
an isolated environment without the interaction with other actors or audience had been
the most difficult experience for the artists. Your patience and concentration were pro-
fessional. Without your support, there will be no study of voice stage support.
My study in the United States was made possible by the support of the prestigious
Sir Edward Youde Memorial Fund Fellowship in Hong Kong. I hereby send my dedica-
tion to late Sir Edward Youde. His wife, Pamela Youdes continuous encouragement
means a lot to me. Respect also goes to all officers of the fellowship council, especially
Ms Carnelia Fung.
I sincerely thank for all my mentors and incredible educators from whom I learned a
lot throughout the years at Rensselaer Polytechnic Institute, California Institutes of the
Arts and the Hong Kong Academy for Performing Arts.
Last but not least, I thank my family for their continuous support. This thesis is
dedicated to my parents, late grandmother, Fung Yau Hau, and my brother Cheuk Chi
Yuen who is recovering from speech disorder after a stroke in summer 2006. His speech
therapy session of repetitive reading is an irony to my research.
8/3/2019 Wa Ms Thesis
10/80
x
Not everything that can be counted counts, and not everything that counts
can be counted. [4]
Albert Einstein
8/3/2019 Wa Ms Thesis
11/80
xi
ABSTRACT
The human voice plays an integral role in dramatic art. The performance of singers and
actors, who perceive their voice through their ears as well as bone conduction, is highly
related to the acoustic condition they are in. Due to the proximity of the sound sourceand the spectral difference in the transmission through the skull as compared to air, a
support condition different from that for musical instrumentalists is needed. This paper
aims at initiating a standardization of methodology in subjective preference testing for
voice stage support in order to collect more data for statistical analysis. A proposal of an
acquisition/auralization system for self-voice and a set of subjective test procedures are
presented. The subjective evaluation of the system is compared to previous designs re-
ported in the literature, and the implementation is validated. A small playhouse has been
measured and auralized using the system described, and subjective-preference tests have
been conducted with 13 professionally trained actors. Their preferred stage-acoustic
conditions (in relation to locations on stage and head orientations) are reported. The re-
sults show potential directions for further investigations and identify the necessary con-
cerns in developing an objective parameter for voice stage support.
8/3/2019 Wa Ms Thesis
12/80
1
1. IntroductionIn the course of theater history, from classical Greek drama to Shakespearean plays
to Ibsens naturalistic plays to 20th
century Broadway rock musicals, human voice has
always been an integral part of the dramatic art. The success of this art largely relies onhow well the audience understands the words voiced by the actors. This rule has not
changed for more than 2,300 years since the day of Lycurgian Theater of Dionysus in
Greece (the first great permanent theater in recorded history) [1].
While most contemporary architects and acousticians are focusing on auditorium
acoustics in the design of performance spaces, the special acoustical needs required by
musicians, singers and actors are often less emphasized. Although acoustic shells have
been developed in concert halls and achieved a certain degree of success, opera houses
and theaters do not receive the same attention. Stage performers are left to adapt to the
acoustics of the space as best they can. [2] In many cases, performers would find it diffi-
cult to hear themselves or each other intelligibly, and thus not manage to achieve their
best tonality; and in extreme cases, they would fail to attain pitch accuracy and coher-
ency individually or in the ensemble. This may result in a less-than-satisfactory perform-
ance. That means the performer-audience communication is not achieved and ultimately
affects how the audience would rate the performance and possibly the acoustics of the
venue. It is strongly suggested that stage acoustics demands as much attention as audito-
rium acoustics deserves. It is logical that an optimal stage acoustics is fundamental to a
good overall rating of the acoustics in a performance space. (Visual appeal also plays a
role in audiences rating on the acoustics, but it is out of the scope of discussion in this
thesis.)
There is currently no parameter in international standards quantifying stage acous-
tics. Among all acoustical parameters widely used in the industry, only one is generally
accepted as a means of quantifying the ease of listening and performing on stage Sup-
port (ST1, ST2), first proposed by A.C. Gade in 1989, which is intended to measure the
contribution of early reflections to the sound from the musicians own instrument. [3]
8/3/2019 Wa Ms Thesis
13/80
2
Gades proposal is however limited to instrumentalists only. For singers and actors
(or voice performers as a consistent terminology for the rest of the thesis) whose in-
strument is the human voice, Support cannot be applied directly because of the influence
of bone conduction in the perception of self-voice. Moreover, Support fails to address
the frequency spectrum and orientations of early reflections; and directions of late rever-
beration which might be determining factors as well. The practicality of using a single
parameter to represent voice performers preferred stage acoustic condition remains
uncertain.
But one thing is clear whether it is to propose a new acoustical parameter or to
validate the effectiveness of a stage acoustics design, subjective preference test is the
only viable means of solving the problem. Every human being is uniquely different from
each other and our preferences to a certain acoustic condition remain highly subjective
and may vary enormously. The preferred stage acoustic condition depends on ones own
voice quality and auditory behavior.
A new study in stage acoustics, called Voice Stage Support (VSS), is proposed to
investigate auditory feedback on stage for professional voice performers. It thus ex-
cludes normal speech communication between common people. It is not the objective of
this thesis to comprehensively define VSS or devise a new parameter experimentally. It
is rather an initiation of ground work in promoting the study of this uniquely different
field in opera house and theater acoustics which involves acoustical design, psychoa-
coustics and performance psychology.
One may argue the tremendous difficulties or the impossibility in generalizing audi-
tory preferences for the entire human population, and it remains a challenge for acousti-
cians.
From error to error, one discovers the entire truth.
- Sigmund Freud
8/3/2019 Wa Ms Thesis
14/80
3
1.1 Aim of the ThesisAs discussed above, in order to study Voice Stage Support (VSS), subjective prefer-
ence tests are inevitable. There are some prominent difficulties in conducting such tests.
In statistical analysis, the key to success is to have a large number of samples in the
population. Thus, it takes a long time for any single researcher to acquire enough data
for analysis and come up with a convincing conclusion. It is particularly difficult for
VSS because professional voice performers constitute only a very small portion of the
human population. It is an effort of numerous researches before any comprehensive the-
ory of VSS can be accomplished. The more subjective data collected, the better the de-
velopment of the study.
Unlike most auditory experiments which involve external stimuli (sound sources
outside ones body), the perception of self-voice strictly requires ones own vocalization
to generate the sound stimuli for the test which demands real-time auralization of audi-
tory scenes and inhibits the use of pre-recorded and pre-processed test stimuli.
The demand of real-time auralization implies a system capable of low propagation
delay (or processing delay). Previous similar acoustical researches often require special-
ized digital signal processing (DSP) equipment which means very few facilities are
qualified to repeat such tests to generate compatible results. However, there are now al-
ternatives since digital audio signal processing becomes more widely available and af-
fordable in the professional audio industry.
The argument here is an easily obtainable, reproducible and repeatable setup for
real-time auralization of self-voice would greatly promote this field of study by enabling
more researchers who have access to professional singers and actors to conduct such
tests and thus enlarging the sample base in the aggregation of compatible data for long-
term statistical analysis.
8/3/2019 Wa Ms Thesis
15/80
4
The first objective of this thesis is to verify the reliability of a more accessible real-
time auralization setup, compared to previous experimental systems found in literature.
It is also a process in standardizing psycho-acoustical experimental procedures involving
real-time self-voice auralization and the respective data acquisition, both in terms of
hardware set up and subject conditioning, for the purpose of voice stage support study.
This thesis includes a subjective evaluation of the stage in a 200-seat playhouse. Various
stage locations and head orientations are compared using the proposed auralization setup
and procedures.
8/3/2019 Wa Ms Thesis
16/80
5
1.2 Historical ReviewThis section briefly covers the issues related the thesis. It first summarizes the field of
stage acoustics and support for performers followed by the difference and difficulties of
support for voice performers as compared to musicians (section 1.2.1 & 1.2.2). The psy-
chophysics of self-voice perception is introduced in section 1.2.3. Previous subjective
preferences tests and their auralization setup are reviewed in section 1.2.4.
1.2.1 Stage Acoustics and SupportStage acoustics can be defined as the study of acoustic conditions of where the per-
formers are located in a performance. In many occasions, performers are located in a
stage house or a stage volume which is spatially discriminated (yet not isolated) from thephysical volume of the auditorium. It is particularly obvious in proscenium theaters and
opera houses. In other settings, such as theater-in-the-round or multi-functional/ modular
theaters, the separation between stage and auditorium acoustic space is less distinguish-
ing and may overlap with each other. Whatever the setting is, performers demand a cer-
tain condition so that they can perform comfortably.
Stage support usually refers to the amount of auditory feedback of ones own in-
strument, which enables the performer to hear him/herself with ease and he/she does not
need to force the instrument to develop the tone. In A.C. Gades pioneering work [5], it
is translated into an objective parameter, SUPPORT, which includes three measures of
energy ratios (ST1, ST2 and STlate) in the sound fields.
ST1= 10log10E(20,100ms)
E(0,10ms)
ST2 = 10log10E(20,200ms)E(0,10ms)
STlate
= 10log10E(100,!ms)
E(0,10ms)
8/3/2019 Wa Ms Thesis
17/80
6
After a few applications and analysis, they were later revised by Gade [6] as:
STearly = 10log10E(20,100ms)
E(0,10ms)
STtotal
= 10log10E(20,1000ms)
E(0,10ms)
STlate
= 10log10E(100,1000ms)
E(0,10ms)
where E () stands for the time integral of the squared pressure signal of an impulse
response between the time limits reported in the brackets. In the above definitions, t = 0
is the arrival time of the direct sound. Unit is in dB.
SUPPORT has been applied in various studies of acoustics for performers [7][8][9],
and is generally agreed with performers subjective preferences. Nevertheless, there are
a few points that attract our attention. Firstly, Gades setup suggests that the measure-
ment microphone position is one meter (roughly the maximum distance between the
players ears and his instrument) in front of the sound source. Secondly, single micro-
phone is used in the measurement.
Gade reported that STearly unexpectedly succeeded in describing the ease of hearing
other musicians rather than its intended purpose [6]. Although the reliability has yet to
be ascertained over a longer period of time, some acoustical consultants have been using
it as a parametric guideline.
However closely related to performers support, it is not applicable in the case of singers
and actors because the instrument concerned - the human voice - is in close proximity to
the ears and there exists a fundamental difference in the perception of self-voice than
that of any other music instruments.
8/3/2019 Wa Ms Thesis
18/80
7
1.2.2 Previous Researches on Subjective Preferences on Stage AcousticsAll previous researches indicated musicians preference (including music instru-
mentalists and singers) on early reflections in support of their performance. Marshall and
Meyer, in 1985, reported singers prefer a strong presence of reverberation, while early
reflections were weakly preferred. [10]
Noson, later in 2000, reported that singers preferred longer delay time of reflections
than musicians due to the masking effect of bone conducted sound inside the head [11].
He also discovered that melisma singing style (non-plosive, non-fricative syllables) re-
sulted in a shift in preferred delay time of reflections [12]. It indicated that subjective
preference is content-dependent of the sound source. In Nosons works, it was also
proved that singers subjective preference on delay time of a single reflection is propor-
tional to the minimum effective duration (e min) of the running autocorrelation function
(ACF) of the sound source. This is in direct agreement with Andos previous research on
audiences subjective preference in concert halls [13]. Andos proposal is thus believed
to be applicable to musicians and singers as well.
Nosons work strongly supported that the unique nature of self-voice perception is
the most significant contributing factor of a different preference pattern for voice per-
formers as compared to instrumentalist.
1.2.3 Self-Voice PerceptionPerception of self-voice is constituted by air conduction (sound wave propagations
from mouth to ear) and bone conduction (vibrations from voice organs to the ear inside
the human head).
The air conduction path includes mainly the diffraction of sound coming out from
the mouth opening, across the surface of the head and into the ear canal. It also includes
all transmissions of vibrations of the vocal tissues from the surface of the head into air,
8/3/2019 Wa Ms Thesis
19/80
8
and back to the ear canal. However, this latter component is believed to be of negligible
contribution to our hearing [14].
The role of bone conduction was not well understood until Georg von Bksy [15],
in 1949, identified bone conduction and air conduction as the sound paths pertinent to
perceiving ones self-voice. Estimations derived from his various investigations show
that the perceived loudness of bone conduction is of the same order of magnitude as that
of the air conduction. According to a more contemporary study of bone conduction by
Stenfelt and Goode [16], the bone conduction path can be divided into four components,
(1) sound radiation into the ear canal, (2) inertial motion of the middle ear ossicles, (3)
inertial motion of fluid in the cochlea, and (4) compression and expansion of the bone
encapsulating the cochlea.
In most occasions, natural human voice is found in the presence of an acoustic envi-
ronment. With the inclusion of acoustic space, the air conduction path can be further di-
vided into two - direct air-conduction (mouth-to-ear) and indirect air-conduction (specu-
lar reflections from boundaries in the acoustic environment).
Hence, the paths constituting the perception of self-voice can be summarized as:
Direct Air Conduction (Direct AC) - from mouth to ear
Bone Conduction (BC) - through skull
Indirect Air Conduction (Indirect AC) - reflections of voice off room boundaries
*Direct AC, BC and Indirect AC are used, throughout this thesis, to denote the above
auditory paths.
Their relationship is represented graphically in a simplified fashion in Figure 1.
8/3/2019 Wa Ms Thesis
20/80
9
Figure 1. Components of perception of self-voice
The spectral characteristics of the above pathways can be identified with human subjects.
For Direct AC, it is usually obtained by measuring the transfer function between the
sound pressure at microphones placed at mouth reference point (MRP) and ear reference
point (ERP) in an anechoic environment, in which human subjects recite a selection of
words effectively covering the vocal frequency range, as demonstrated by Prschmann
[17] as well as Williams and Barnes [18]. For BC, direct measurement cannot be applied.
It is determined by measuring the masked threshold of pure-tone or narrow-band noise
while the air-conducted sound is removed (or highly attenuated) [17]. In general, the
threshold increases as frequency rises. Nevertheless, in Stenfelts research [19], it is
found that sensitivity in loudness perception in bone conduction is higher than that in air
conduction. And this trend is progressively more drastic as listening level increases,
suggesting that the loudness contour of bone conduction is different from that of air con-
duction. (Air conduction loudness contour refers to the Fletcher and Munson curve in
1933. [20]) To determine Indirect AC, method similar to that for Direct AC can be used
with human subjects. Binaural receivers can be fitted in the subjects ear canals. By
8/3/2019 Wa Ms Thesis
21/80
10
measuring the impulse response of the receiver at MRP and the binaural receivers in the
ears, the transfer function of the room can be determined. Then, the Indirect AC can be
isolated by properly removing the direct sound from the impulse response.
For Direct AC and BC, an average result can be collected from a group of subjects in the
laboratory. However, for Indirect AC, it requires bringing a large number of subjects to
each acoustic condition being examined (i.e. different concert halls and different stage
positions) which is impractical in most cases. An alternative approach is discussed in
Section 2.1 in this thesis.
1.2.4 Experimental Setup in Previous Self-voice Auralization and Related SoundField Simulation
In the subjective evaluation of a sound field for singers or actors, owning to the use
of self-voice as the sound stimulus, one must create an auralization setup capable of re-
producing (1) the direct mouth-to-ear air conduction (if such path is to be obstructed by
the reproduction system) and (2) the convolution product between the live signal and the
impulse response of the sound field under test, all in real-time.
At the source pickup end, there is no consistency in previous experiments. A micro-
phone is usually placed in front of the mouth. However, the microphone type and micro-
phone-to-mouth distance vary greatly between researches. In Marshall and Meyers
setup [10], a cardioid microphone is placed at 0.5m directly pointing at the mouth; in
Nosons setup [11], a small headset microphone (no polar pattern specified) is located at
10cm in front and 5cm below the mouth; and in Prschmanns experiment [21], a Senn-
heiser KE4 omni-directional miniature microphone is positioned precisely at the mouth
reference point (MRP40) 40mm in front of the lips, with a holding device attached to
the headphones harness.
At the sound reproduction end, there are generally two different approaches (1)
spatially distributed loudspeakers for reproduction of delayed reflections, as found in
Marshall and Noson; and (2) open-back circumaural headphones with compensation fil-
ters for binaural sound field simulation, found in Prschmann. The two reproduction ap-
8/3/2019 Wa Ms Thesis
22/80
11
proaches have their pros and cons. Using loudspeakers inherently create the possible
acoustic feedback path which means, for instance, the delayed reflection is picked up by
the microphone and then more reflections are regenerated again through the system. This
leads to unintended stimuli and eventually affects the accuracy of the subjective test. Its
advantage is that subjects are free of restraint by any body attachment. However, the
speaker system requires a comparatively larger space and is usually not portable.
The headphone system, on the contrary, is less demanding for the laboratory space
and is fairly portable and easy to set up. The disadvantage of using headphone system is
the necessity of implementing a compensation filter for the direct mouth-to-ear air con-
duction path because, even when open-back circumaural headphones are used, the head-
phone enclosures inevitably impose a sound insertion loss between the mouth and the
ears. High frequencies are usually attenuated. Moreover, the compensated (filtered) sig-
nal needs to be delayed before reaching the headphones so that it is in sync with the
natural air conduction to avoid comb-filtering effect. Prschmann [21] has shown that
such reproduction scheme is successful in achieving a certain degree of naturalness in
virtual auditory environment. Another issue with headphones is the occlusion effect and
return of radiation by the human head. Occlusion effect refers to the accentuation of sen-
sitivity in bass frequencies when the ear canal is obstructed. Details of the occlusion ef-
fect can be found in literature written by Tonndorf [22] and Dean [23]. Open-back head-
phones can minimize such effect and have been accepted and used in experiments pro-
vided that there is enough padding between the headphone hardware and the test subject.
8/3/2019 Wa Ms Thesis
23/80
12
1.3 Thesis Outline
In Chapter 2, the proposed self-voice auralization system dedicated for the investigation
of voice stage support is introduced. It includes the binaural impulse response acquisi-
tion system, the binaural auralization system and the subjective preference test proce-
dures. Also, a validation test with human subjects to obtain subjective rating on the natu-
ralness of the system is reported. The results were compared to evaluation of setups in
previous researches. The proposed system was used to investigate stage acoustic condi-
tions of a playhouse. Chapter 3 shows the results and analysis of actors subjective pref-
erences on various stage locations and head orientations. Chapter 4 discusses the ex-
perimental results, followed by suggestions in directions for future work in the field of
voice stage support.
8/3/2019 Wa Ms Thesis
24/80
13
2. Self-voice Auralization System: Design and Implementation
2.1 Experimental Design ConceptAs discussed in section 1.2.3, using human subjects to obtain an average of Indirect
AC data is impractical when a lot of acoustic spaces and conditions have to be examined.
Portability and repeatablility were the first criteria of the current design. In order to
achieve that, a dummy head was proposed to substitute the human head in the measure-
ment and acquisition process. An artificial mouth was used to represent the human voice
source. Dunn and Farnsworth [24] showed that a persons own voice can be modeled by
a source at the opening of the mouth. Similar approach has been taken and examined by
Bozzoli, Viktorovitch and Farina [25].
The design consists of three basic components:
- Binaural Room Impulse Response (BRIR) Acquisition
- Binaural Real-time Auralization
- Experimental Procedures for Test Subjects
Since the experimental design was logically driven by the implementation of aural-
izing the conduction paths, the design of real-time auralization is first described (section
2.3) followed by the BRIR acquisition (section 2.4) and experimental procedures (sec-
tion 2.5).
2.2 Measurement Setup
In this thesis, the acoustical measurement system was Electronic & Acoustic System
Evaluation & Response Analysis (EASERA) v1.0.60 software running on a Pentium M
processor based PC, with a Sound Devices USB Pre (USB-powered audio interface) for
audio input/output. Sampling rate was set at 48 kHz and bit depth is 16. Excitation signal
8/3/2019 Wa Ms Thesis
25/80
14
was a pink sweep sine with 1 pre-send and 3 averages customized in EASERA unless
otherwise stated.
Two measurement microphones were used. They were Earthwork M30 and Countryman
B3 omni-directional microphones (Figure 2 & Figure 3). The M30 was chosen for its
high sound pressure capability whereas the B3 is chosen for its compact size for meas-
urement positions where M30 is unable to reach. Their transfer functions were first
measured in order to compensate for their difference in frequency characteristics when
they were used simultaneously. A pink sweep sine was produced using a Yamaha MSP5
2-way studio monitor speaker at a distance of 1m in front of the microphones. The mi-
crophones outputs were recorded using the above EASERA setup and the transfer func-
tion was then obtained for use as an equalization function in subsequent calculations.
Figure 4 shows the microphones transfer function.
Figure 2 Earthwork M30 omni-directional microphone
Figure 3 Countryman B3 miniature omni-directional microphone
8/3/2019 Wa Ms Thesis
26/80
15
Loudspeaker for excitation source is an artificial mouth in a dummy head unless other-
wise stated. The detailed structure of the dummy head is described in section 2.4.
All measurements were conducted in a hemi-anechoic chamber unless otherwise stated.
Figure 4 Transfer function from Earthwork M30 to Countryman B3
2.3 Binaural Real-time Auralization System2.3.1 System Overview
In this research, only Direct AC and Indirect AC need to be implemented. Since
human subjects own voice is used as sound stimuli in real-time auralization, the bone
conduction component is produced naturally inside the subjects head. The auralization
8/3/2019 Wa Ms Thesis
27/80
16
system used a topology of two separate paths to model the direct air conduction (Direct
AC) and indirect air conduction (Indirect AC). All auralizations were conducted in a
hemi-anechoic chamber.
Figure 5 Binaural self-voice auralization system block diagram
Figure 5 shows the system block diagram. The setup used an Earthwork M30 omni-
directional microphone to pick up the subjects voice. It was positioned at the mouth-
reference-point MRP (80mm from the lips) separated from the subjects mouth by a
metal-grille pop-filter mounted at 40mm from the diaphragm so as to eliminate micro-
phone diaphragm excursion caused by plosive sounds. Figure 6 shows the relationship
between the subjects mouth, the pop filter and the microphone. The microphone signal
was split into two and connected to input channels 1 & 2 (Ch1 & Ch2) on a Yamaha 02R
digital mixing console, with identical and repeatable gain setting using the step-gain con-
trol on the pre-amplifiers. The gain setting was optimized to achieve a peak at -10 dBFS
using a mic calibrator of 1kHz sine tone at 105 dBA. The resulting line-level signals
were routed to two paths, Path 1 & Path 2, modeling the Direct AC and Indirect AC re-
spectively.
8/3/2019 Wa Ms Thesis
28/80
17
Figure 6 Test subject with microphone and pop filter.
(Path 1) Through the channel-insert-send before the A/D stage on the Ch1, the pre-
amplified signal was connected to a Klark Teknik DN-716 single-channel digital delay
unit (with built-in 16-bit A/D & D/A conversion) cascaded with a Rane PE-15 4-band
analog parametric equalizer. The analog output was returned to the channel-insert-return
of Ch1, going into the A/D conversion stage on the 02R. The delay unit and parametric
equalizer were used to model the mouth-to-ear propagation delay and transfer function
respectively. Their implementations are further described in section 2.3.2. The 02R on-
board digital equalizer on Ch1 was used as the compensation filter for the insertion loss
introduced by the auralization headphones. Details are described in section 2.3.4.
(Path 2) Through Ch2, the signals was A/D converted and digitally routed, via an optical
connection (TOSLINK) to a Digidesign TDM MIX digital audio workstation (Motorola
8/3/2019 Wa Ms Thesis
29/80
18
DSP-based PCI mixing engine running in a Macintosh dual processor 500MHz G4 com-
puter) using a Digidesign ADAT Bridge digital interface. The workstation was running
ProTools audio software with Waves IR1 (dual-channel convolution reverberation plug-
in) through which a BRIR can be loaded and convolved with the incoming signal. The
output (convolved) signal was returned to the ADAT IN (TAPE IN 1) on the 02R mixer
via the ADAT Bridge digitally. The setup described was used to model the Indirect AC
or called the room response. The convolution implementation is further described in sec-
tion 2.3.3.
Both returns from Path 1 and Path 2 were internally routed to 02Rs main stereo output
in the digital domain.
The stereo output of the 02R was connected to a Samson HP-5 headphone amplifier
driving a pair of Audio-Technica ATH-A700 open-back headphones. A compensation
filter was implemented using the on-board equalizer on the 02R stereo output channel to
remedy the frequency anomalies induced by the headphones. It is described in section
2.3.4
The A/D & D/A conversions in the 02R and DN-716 are all 16-bit, 48kHz. Each conver-
sion stage introduces a processing delay of 0.02ms. The processing delay of Path 1
measured 0.14ms (A/D & D/A conversion and filter network in DN-716 and conversion
stages in 02R) when DN-716 is at its lowest setting 0.01ms whereas the processing delay
of Path 2 measured 11.74ms (A/D & D/A conversion plus latency of IR1 [11.6ms])
while IR1 was engaged and loaded with a 3-second long unit-sample sequence. All lev-
els were set at unity gain during the delay measurement.
8/3/2019 Wa Ms Thesis
30/80
19
2.3.2 Implementation of Direct Air Conduction Modeling
In Path 1, which is designed to model the Direct AC, the MRP-to-ERP transfer function
(measured in section 2.3.2.3) is approximated using the four-band parametric equalizers
PE-15, the digital delay DN-716 and the internal equalizer on the 02R.
2.3.2.1 Determining the PE-15 filter settingThe frequency response of the MRP-to-ERP impulse response was approximated using
an analog parametric filter. The precise settings of the filters were determined by over-
laying the transfer function of the PE-15 against the MRP-to-ERP magnitude-spectrumplot. Using the Live mode in EASERA, the MRP-to-ERP plot was pre-loaded. A pink
noise was fed to PE-15 at line level and itss output was connected directly back to
EASERA to obtain a live magnitude-spectrum plot while adjusting the PE-15 settings.
Figure 7 shows an overlay magnitude-spectrum plot of the impulse responses of MRP-
to-ERP and the determined filter settings in PE-15 (see Table 1). The plot was generated
in Matlab.
8/3/2019 Wa Ms Thesis
31/80
20
Figure 7 MRP-to-ERP Transfer Function & Direct AC Filter using PE-15 parametric equalizer (1/3
octave smoothing)
Table 1. Direct AC Filter Settings (Rane PE-15)
Direct AC Filter Band 1 Band 2 Band 3 Output Level
Gain (dB) +4.0 -5.5 -8.0 -18.0
Frequency (Hz) 90 800 7k -
Q 1.2 0.26 0.45 -
8/3/2019 Wa Ms Thesis
32/80
21
2.3.2.2 Determining the DN-716 delay timeThe initial arrival time of the ERP-to-MRP impulse response was implemented using a
digital delay line. The mean value of MRP-to-ERP propagation delay in human is 300 s
(or 0.3ms) as reported by Prschmann [17]. Thus, by subtracting the processing delay
0.14ms, the delay time to be inserted is 0.16ms (corresponding to a panel display of
0.17ms on the DN-716). The MRP-to-ERP transfer function measurement is described in
section 2.6.2.3. Various delay times are evaluated in section 2.7.1
2.3.2.3 Determining the 02R parametric equalizer setting
The headphones used in auralization introduced an insertion loss in the Direct AC path.
As a result, Path 1 essentially functions as Direct AC modeling and compensation of in-
sertion loss (CIL) induced by the headphones. The CIL filter was implemented using the
digital parametric equalizer on Channel 1 in the 02R mixer.
Two microphones, M30 and B3, were first calibrated for identical gain and then used to
measure the insertion loss of the headphone as shown Figure 8.
Figure 8 Setup for measuring the headphone's insertion loss using an isolation tube.
8/3/2019 Wa Ms Thesis
33/80
22
Figure 9 Isolation tube used in insertion loss measurement
An isolation tube (see Figure 9) was built to measure the insertion loss of the headphone.
The tube was 300mm in length and 250mm in diameter. It had a 50mm thick soft fiber-
glass outer shell with a thin layer of cotton lined in the inner wall. The headphone was
carefully mounted to the tube opening and is sealed with rubber for any air-gaps. A Ya-
maha MSP5 2-way loudspeaker was used to generate the measurement signal while an
M30 microphone was positioned close to the headphone enclosure outside the tube and a
B3 microphone was mounted 10mm away from the headphone transducer inside the tube.
The transfer function was recorded using EASERA. The inverted magnitude-spectrum
plot represents the compensation filter.
The internal digital parametric equalizer in the 02R was used to approximate the com-
pensation filter response using a similar adjustment method described above for PE-15
(section 2.3.2.1). To assure unity gain through the 02R during filter adjustment, a sine
tone was fed from EASERA and split to two. One was routed back to EASERA channel
1 and the other was connected to the 02R and returned to EASERA Channel 2. The CIL
filter implemented in the 02R is shown in Table 2.
8/3/2019 Wa Ms Thesis
34/80
23
Table 2. CIL filter settings (digital parametric equalizer on 02R)
CIL filter Band 1 Band 2 Band 3 Band 4
Gain (dB) +2.0 +3.0 - -
Frequency (Hz) 4k 10k - -
Q 0.2 0.1 - -
2.3.3 Implementation of Indirect Air Conduction Modeling
In Path 2, which modeled Indirect AC, a real-time binaural convolution was applied
using the Waves IR1 Convolution Reverb plug-in (see Figure 10). In order to time-align
correctly, the room impulse response to be convolved was first trimmed to eliminate the
direct sound. The length of the trim was determined by the propagation delay in Path 2
which measured 11.74ms (see Figure 11). A Hann window was applied to the trimmed
impulse response before importing to IR-1. A shortcoming resulted from the latency is
the incapability of reproducing the room response between Direct AC and the early
sound field up until 11.74ms (approximately 12 feet of traveling distance which is in the
order of a 6-foot tall person) which may include diffractions from the subjects own
body and the first back scattered sound from the floor or other possible nearby bounda-
ries. Nevertheless, the focus of the current research is stage acoustics which seldom in-
volves boundaries in close proximity (at least not in the case of this thesis). Also, there is
no direct specular reflection path between the mouth and the floor. The back-scattered
rays from the floor were assumed to have minimal influence on the perception of self-
voice.
8/3/2019 Wa Ms Thesis
35/80
24
Figure 10 Waves IR1 Convolution Reverb, loaded with a 3-second unit sample sequence
Figure 11 Impulse response trimming before importing to IR-1
8/3/2019 Wa Ms Thesis
36/80
25
2.3.4 Implementation of Headphone Equalization
The frequency response of the headphones was measured in a hemi-anechoic chamber.
An M30 microphone was positioned at 10mm in front of the headphones transducer.
The impulse response was recorded using EASERA. The internal digital parametric
equalizer on the 02R was used to approximate the headphone compensation filter using
the method described in section 2.3.2.1. The headphones response was plotted against
the inverted compensation filter in Figure 12 and the filter settings is shown in Table 3.
Figure 12 Headphone response and compensation filter (02R master output)
8/3/2019 Wa Ms Thesis
37/80
26
Table 3. Headphone compensation filter settings (02R master output)
Headphone
compensation
Band 1 Band 2 Band 3 Band 4
Gain (dB) +8.0 +2.5 -4.0 -2.5
Frequency (Hz) 40 185 1.7k 9.1k
Q 0.1 0.3 1.0 1.2
2.4 BRIR Acquisition System
In order to achieve repeatability in binaural acquisition, a head simulator (or some-
times called dummy head) was used. In the particular interest of this thesis, the sound
source and receivers corresponds to human mouth and ears, thus microphones and loud-
speaker were installed inside the dummy head. The heart of the design was a Brel &
Kjr Type 5930 head and torso simulator (HATS). The head geometry theoretically rep-
resented an average of human head physical features according to the compliance of
ITU-T Rec. P.58, IEC60959 and ANSIS3, 36-1985. It was retrofitted with a loudspeaker
unit of diameter 50mm, inside the mouth cavity as an artificial mouth. The microphones
mounted inside the HATS were Brel & Kjr Type 4010 omni-directional transducers.
The grille of the capsules aligned with the opening of the ear canals as binaural receivers.
(see Appendix for the microphone specifications in free-field). The structure of the
HATS is shown in Figure 13 and the position of the artificial mouth is illustrated in
Figure 14. Detailed dimensional information of the HATS can be obtained from the
Brel & Kjr website (www.bksv.com)
8/3/2019 Wa Ms Thesis
38/80
27
Figure 13 Frontal plane section of HATS, showing binaural microphones and the related fittings.
(Adapted from manufacturers manual)
8/3/2019 Wa Ms Thesis
39/80
28
Figure 14 Median plane section of HATS, showing the artificial mouth
To validate the representativeness of the HATS, a series of verification tests were
conducted to examine the binaural microphone characteristics, artificial mouth fre-
quency responses and MRP-to-ERP transfer function. (See section 2.6)
In BRIR acquisition, the HATS was supported by a microphone stand so that itsheight measured 5 ft 7 in or 67 inches (approximately 1.7-meter), which is about the
mean height between age 20 to 74 of both men and women reported by U.S. Department
Of Health And Human Services in 2004 [26]. The binaural microphones inside the
HATS were connected to the EASERA measurement system during acquisition. Before
data acquisition, their gain settings were optimized to -10dBFS using a microphone cali-
8/3/2019 Wa Ms Thesis
40/80
29
brator of 1kHz sine tone (105dBA). Since the binaural microphones cannot be easily re-
moved from the HATS for calibration, and any such repetitive preparation may imply
microphone position misalignment, a compromised approach was proposed. The calibra-
tor was positioned as close to the binaural microphone as possible while it was on axis.
The artificial mouth was driven by a Samson Servo 170 power amplifier which has
a published linear frequency response between 20Hz and 20kHz. Figure 15 shows the
BRIR acquisition system block diagram.
Figure 15 Binaural room impulse response acquisition system block diagram, showing how the bin-
aural ears and artificial mouth of the HATS are connected
As described in section 2.3.3, the binaural room impulse response was trimmed such
that the direct sound (including any contribution from internal (bone) conduction and
direct air conduction) in the BRIR was not included in the convolution.
8/3/2019 Wa Ms Thesis
41/80
30
2.5 Experimental Procedures in Subjective Test
2.5.1 Test Subject ConditioningIn the current research, it was inevitable to use test subjects voice in real-time
auralization, the sound stimuli thus becomes highly unpredictable and may lead to
erroneous results due to variance in the subjects conditions, both physiological (i.e.
vocal fatigue) and psychological (i.e. personal emotion). In order to minimize the
experimental errors, a set of experimental procedures for human subjects was adapted
and modified from the method used by Jnsdottir, et al [27]. In each measurement,
subjects are asked to read a piece of text at least twice before subjective scoring. Foreach subject, the same set of test was repeated 6 times within a period of 21 days; on the
test day, three trials were in the morning/midday, and three other trials were in late
afternoon/ early evening. Before each experimental session, subjects were asked to warm
up their voice to performing condition (which takes about 10-20 minutes). Subjects
were also assured of under no influences of drugs and alcohol. The above procedures
were expected to lessen the impact of subjects individual conditions over a period of
time.
2.5.2 Use of Dramatic Text in Study of ActorsThe objective of the thesis is to investigate voice support in theaters, and so subjects
were actors who had been trained professionally. Instead of sentences from the Harvard
Psychoacoustics Sentence Lists (often recommended for psychoacoustics research of
speech), a short edited excerpt from Shakespeares playHamletwas chosen for its well-
known dramatic expression and inclusion of most vowels and consonances in English.
8/3/2019 Wa Ms Thesis
42/80
31
To be, or not to be; that is the question;
To die, to sleep, nor more;
and by a sleep to say we end the heart-ache
and the thousand natural shocks.
The reason of using a dramatic text was because actors seldom read sentences that
do not have literal meanings. The Harvard sentences are far from reality and are consid-
ered to have no representation of acting in theater. The second argument in this particu-
lar thesis was that actors always know what they are going to say (as they have rehearsed
before performance), and because the current research is about self-perception, so there
is no issue of intelligibility of unexpected words/ vowel sounds from an unknown sound
source.
Subjects were given a sample recording of the text, prior to each test sessions, to get
accustomed to the rhythm and speed of the speech. And the entire test was recorded and
analyzed for the pace afterwards.
2.5.3 Verifying the Consistency of Self-voice Stimuli by Monitoring the Pace ofSpeech
A method of pace analysis was developed with reference to Andos work on the re-
lationship between subjective preference and objective parameters. Ando found that the
minimum effective duration (e min) of running auto-correlation function (ACF) of the
sound source is proportional to the most preferred delay of a single reflection [28].
Andos results suggested that the faster the tempo of the stimuli, the lower the resultant
e min and thus the shorter preferred delay of first reflection. In the current thesis, it was
attempted to use this parameter as a reference to the pace of speech.
8/3/2019 Wa Ms Thesis
43/80
32
ACF is defined by:
!p (") = lim
T#$
1
2Tp '(t)p '(t+ ")dt
%T
+T
&
where p(t) = p(t) * s (t), s(t) corresponds to ear sensitivity and is chosen as the impulse
response of an A-weighting filter, as suggested by Ando.
The normalized ACF is expressed as:
!p(") =
#p(")
#p(0)
The effective duration of the envelope of normalized ACF is represented by !e, which is
the initially 10dB of the ACF decay (or called the 10 percentile decay), obtained by line
regression.
!eis obtained at 100ms-intervals of the running source signal and the minimum
value is expressed as e min. Figure 16 shows the values ofe against the elapsed time of
two speech recordings. The e min for sample 1 and 2 are 0.39s and 0.37s respectively,
indicating that a slight variation of the pace of speech would not change the overall tem-
poral characteristics.
8/3/2019 Wa Ms Thesis
44/80
33
Figure 16 Example plot of effective duration of running autocorrelation function of two recordings
of the same text at different pace, showing the use of e min as a temporal reference for monitoring
subjective testing.
From Figure 16, it is observed that there is a slight shift of the peaks which indicates
the different paces of speech in the two samples (red is faster and blue is slower).
Through a number of trials, it was proven that e min can be used as a robust quantifier for
verifying the speech pace in subjective testing.
By calculating e min on each trial, the pace was ensured by maintaining a deviation
of less than 5% of the e min value of the sample recording. This is a procedure to secure
all subjects are giving their preference score under the same condition within a fixed tol-
erance.
8/3/2019 Wa Ms Thesis
45/80
34
A total of 15 subjects participated in this project, but 2 of them were dropped out
because they did not manage to attend all required tests. The final 13 subjects included 6
Caucasians, 2 Asians, 3 Black Americans and 2 Latin Americans.
All subjective tests in this thesis followed the above scheme unless otherwise stated.
2.6 HATS Verification TestsAll evaluation measurements of the HATS were conducted in an anechoic chamber
at General Electric Laboratory, Niskayuna, NY (see Figure 17). The volume of the
chamber was 5100 cu. ft. The background noise was rated at under 20dbA. Room tem-
perature and humidity were measured before and after each experimental session and
recorded negligible variations.
Figure 17 Verification Test of HATS in anechoic chamber at General Electric Laboratory (NY).
8/3/2019 Wa Ms Thesis
46/80
35
2.6.1 Binaural Microphones
The microphones at the two ears of the HATS were first calibrated using a sine tone
calibrator (1kHz at 105dBA) to achieve identical gain. A dodecahedron speaker was po-
sitioned at 1 meter away from the HATS, directly in front of the HATS mouth opening.
The frequency responses were averaged over 3 measurements taken with different
orientations of the dodecahedron speakers to minimize any potential error caused by di-
rectional characteristics of the dodecahedron loudspeaker in near field.
Figure 18 shows a frequency response overlay of the two binaural microphones.
Figure 18 Frequency response comparison between HATS binaural microphones
The result showed acceptable differences between the two binaural microphones.
The peaks and notches are due to head-related transfer function and pinna of the HATS.
8/3/2019 Wa Ms Thesis
47/80
36
2.6.2 Artificial Mouth
In BRIR acquisition, the loudspeaker unit inside the HATS was used to generate ex-
citation signals in the HATS mouth cavity; it is thus called the Artificial Mouth.
There are no readily available specifications of the artificial mouth from the HATS
owner. Its directivity, on-axis response, and ERP-to-MRP transfer function were meas-
ured.
2.6.2.1 DirectivityThe directivity measurements were conducted in 15-degree resolution on the horizontal
plane (full 360 degrees) and the vertical plane (range from -45 to 135 degrees). The
HATS was left stationary throughout the measurement session while the microphone
position was manually adjusted for each measurement.
Figure 19 Voice directivity of HATS (a) horizontal plane (b) vertical plane, in 4 octave bands
(250Hz, 500Hz, 1khz & 2kHz)
The directivity was plotted using Matlab and was compared to data from other pre-
vious artificial heads and other mean values of human voices, as shown in Figure 20.
8/3/2019 Wa Ms Thesis
48/80
37
Figure 20 Comparison of voice directivity, in 3 ocatve bands (500Hz, 1kHz & 2kHz)
The current HATS was found to be reminiscent to the human mean values reported
in literature. Although it did not directly reflect a higher reliability of experimental re-
sults, it did suggest the HATS was a good representation of human voice source.
2.6.2.2 On-axis Frequency Response
The on-axis frequency response of the HATS was measured at 1m directly in front
of the artificial mouth. Figure 21 shows the frequency response. It was rather erratic
which may have resulted from the construction of the HATS. The HATS itself had in-
herent resonance characteristics and the head cavity was not dampened by any materials.
8/3/2019 Wa Ms Thesis
49/80
38
The prominent peaks in the high-mid frequency range were believed to have relation-
ships with the resonant frequencies of the HATS.
Figure 21 On-axis frequency response of HATS artificial mouth. Overlays of no-smoothing and
1/6-octave smoothing.
Two resonant peaks were observed 6.6k and 7.6k Hz which roughly correspond to dis-
tances of 0.05m and 0.045m. It was believed they are related to the position of the loud-
speaker unit in respect to the HATS internal cavity.
The current artificial mouth response was compared to another commercially available
model, B&K Type 4128C artificial mouth (see Figure 22).
8/3/2019 Wa Ms Thesis
50/80
39
Figure 22 Frequency Response of B&K Artificial Mouth Type 4128C [adapted from manufacturers
datasheet]
The frequency response of a loudspeaker unit coupled to the HATS is expected to dis-
play anomalies due to the interaction with the physical geometry. It is hardly possible to
achieve a flat frequency response in such deviced. The assumption here is the given
similar voice directivity analyzed in section 2.6.2.1 is acceptable in the current study.
2.6.2.3 MRP-to-ERP Transfer FunctionTwo microphones were used in this measurement. They are Earthwork M30 omni-
directional microphone and a Countryman B3 miniature omni-directional microphone.
The two microphones were first calibrated by measuring their transfer function (see sec-tion 2.2) which was used as an equalization function in the measurement.
The M30 and B3 were placed at MRP and ERP respectively. The B3 capsule was
suspended such that there was no physical contact with the HATS. It was to eliminate
any internal vibration transmission from the artificial mouth to the microphone via the
8/3/2019 Wa Ms Thesis
51/80
40
HATS surface. The transfer function was measured in EASERA and compensated with
the equalization described above. Result is shown in
Figure 23 MRP-to-ERP transfer function of HATS (1/3-octave smoothing)
The above plot was smoothed and zoomed in order to compare to the averaged results of
human subjects in Prschmanns study, as shown in Figure 24.
8/3/2019 Wa Ms Thesis
52/80
41
Figure 24 Averaged frequency response of MRP-to-ERP (Direct AC) of 18 human subjects. The
grey area marks the satndard deviation. (Adapted from Porschmann [2000])
The HATS frequency response was considered to be roughly falling into the human
averaged values, except it was slightly lower in the frequency range between 300Hz to
2kHz.
2.7 Subjective Test on Naturalness of the Auralization SystemConsidering the deviation of every human heads, the validity of the auralization
setup can be proven by subjective tests. The aim of this evaluation test is to find the op-
timal delay time, and filter implementation for Direct AC, as explained in section 2.3.2.1
and 2.3.2.2. In the evaluation of the systems naturalness, Indirect AC (as stated in sec-
tion 1.2.3) was ignored.
8/3/2019 Wa Ms Thesis
53/80
42
2.7.1 Evaluation on Naturalness of CIL Filter Delay Time
The delay time implementation for Direct AC represents the propagation delay from the
mouth opening to the ear canal entrance. It is critical in the auralization process because
the delayed voice reproduced by the headphone was acoustically combined with the di-
rect sound traveling from the mouth through the open-back headphones, before entering
to the subjects ear canal. Improper delay time would induce perceivable echoes or
comb-filtering effect. Since everyones head geometry and facial features are different
from each other, the delay time may vary. It was assumed that the comb-filters were in-
audible if the separation of arrival time is short enough so that the lowest null frequency
of the comb filter is beyond the audible spectrum. For instance, a time delay of 20-
microsecond would produce comb filtering starting at a frequency of 25kHz.
This test was to find out the most natural and representational delay time to be used in
auralization without compromising the perceptual naturalness of self-voice. The tests
were conducted with 13 subjects following the procedures stated in section 2.5, using the
auralization setup in section 2.1.
During the test, subjects were exposed to a random sequence of delay times which in-
cluded 3 repetitions of the 8 delay settings. The random sequence was generated by Mat-
lab individually for each test set. For each setting, subjects were asked to read the given
text in full twice, one with headphones and one without. Subjects were allowed to repeat
reading and listening until they were comfortable to compare the sound of their own
voice with headphones to that without headphones (the reference). Then, the subjects
would rate the degree of similarity on a 7-point category scale in the range of 1 to 7 (1
being very dissimilar and 7 being very similar) for a given setting and notify the experi-
menter via the microphone before changing to the next delay setting.
The aims of this evaluation test are to find out the most optimal delay time for
the CIL filter for auralization and validate the effectiveness of the current system by
comparing the results in previous experiments. Thus, the delay times evaluated were
8/3/2019 Wa Ms Thesis
54/80
43
specifically chosen to match with those found in literature in order to make a direct com-
parison (see Table 4).
Table 4. Delay time in evaluation test of naturalness of Direct AC insertion loss compensation filter
(CIL filter). *The current system has a processing delay of 0.14ms with a setting of 0.01ms on
DN716. ** Prschmann's tested delay times were based on taps of a 48kHz Tucker Davis DSP sys-
tem, and they are represented here in millisecond for convenience of comparison.
DirectAC
Delay Time0.14ms 0.3ms 0.47ms 0.63ms 0.97ms 1.63ms 2.30ms 2.97ms
DN716*
Delay Setting0.01ms 0.17ms 0.34ms 0.50ms 0.84ms 1.50ms 2.17ms 2.84ms
Prschmanns
Test [2001]**-
0.3ms
(0taps)
0.47ms
(8taps)
0.63ms
(16taps)
0.97ms
(32taps)
1.63ms
(64taps)
2.30ms
(96taps)
2.97ms
(128taps)
The same test was repeated 6 times for each subject totaling 144 data. All results were
combined and compared with Prschmanns published results as shown in Figure 25.
8/3/2019 Wa Ms Thesis
55/80
44
Figure 25 Subjective evaluation of naturalness of delay time in Direct AC auralization. Mean score
and error rate of all subjects (95% confidence)
The above comparison showed that the current evaluation exhibit a similar preference
trend as observed in Prschmanns results. Subjects tended to prefer lower delay times.
The current experiment included 1 more delay setting (0.14ms), which is less than the
nominal MRP-to-ERP propagation delay of 0.3ms. Interestingly, the most preferred de-
lay in the current setup is 0.14ms. This might be resulted from the difference in the defi-
nition of MRP (MRP80 80mm from the lips) in the current research as compared to
Prschmanns (MRP40 40mm from the lips).
8/3/2019 Wa Ms Thesis
56/80
45
2.7.2 Evaluation on Naturalness of CIL Filter Level
This test was to find out the sound level of the CIL-filtered signal at the headphone to
achieve the best realism. The tests were conducted with 13 subjects following the proce-
dures stated in section 2.5, using the auralization setup in section 2.1.
There are six levels under test. Inf, -6dB, -3dB, 0dB, +3dB and +6dB. 0 dB indicates
unity gain of the CIL filter and Inf refers to the absence of the filter compensation path
in auralization. Positive dB values represent a gain in the filtered compensation whereas
negative values represent an attenuation.
During the test, subjects were exposed to a random sequence of sound levels which in-
cluded 3 repetitions of 6 different settings. The random sequence was generated by Mat-
lab individually for each test set. For each setting, subjects were asked to read the given
text in full twice, one with headphones and one without. Subjects were allowed to repeat
reading and listening until they were comfortable to compare the sound of their own
voice with headphones to that without headphones (the reference). Then, the subjects
would rate the degree of similarity on a 7-point category scale in the range of 1 to 7 (1
being very dissimilar and 7 being very similar) for a given setting and notify the experi-
menter via the microphone before changing to the next CIL-filter level setting.
The same test was repeated 6 times for each subject totaling 108 data. All results were
combined and compared with Prschmanns published results as shown in Figure 26.
The results showed that the current evaluation has an overall lower score in all settings
but displays a similar trend as observed in Prschmanns setup. Subjects agreed the
nominal level setting for the CIL filter is most natural, proving the successful implemen-
tation.
8/3/2019 Wa Ms Thesis
57/80
46
Figure 26 Subjective evaluation of naturalness of CIL filter level in Direct AC auralization. Mean
score and error rate of all subjects (95% confidence)
2.8 Discussion
Overall, the current auralization setup gave satisfactory results in evaluation of
naturalness in Direct AC. The CIL filter level was determined to be at unity gain. The
delay setting was chosen to be 0.01ms on DN-716 (resulting in 0.14ms of total delay).
8/3/2019 Wa Ms Thesis
58/80
47
3. Subjective Preference Tests on Stage Acoustic Conditions for Actors
3.1 Introduction
The focus of the current study is voice stage support in proscenium theaters. This
particular type of theater architectural design creates a special acoustical phenomenon. A
proscenium theater is characterized by the separation of a stage house from the main
auditorium by a large opening called proscenium. The two acoustic volumes (stage &
auditorium) can have very different acoustical properties depending on the design. In
general, the stage house usually has a high ceiling for counter-weight flying systems and
thick curtains were hung above and beside the stage area in order to mask off-stage ar-
eas, lighting instruments and unused scenery pieces. The stage is often not specifically
designed for acoustical purposes, whereas, in the auditorium, it is usually designed to
optimize the audiences aural experience.
The interaction of these two volumes is known as coupling space phenomenon in
architectural acoustics. Considering the actors location and movement on stage, they are
mostly in the stage volume. As they move closer to the audience and reach the prosce-
nium or the forestage, they are moving into the aperture of the coupled spaces where the
stage volume and the auditorium meet. Unlike an audience, actors are constantly moving
and turning on stage while acting. Their experience of the sound field changes drasti-
cally every second. As a beginning of research work in this field, the current study aims
at investigating acoustic conditions on stage presuming the actor is not moving. Actors
sometimes speak in a fairly fixed location for a sustained period when they are having
monologues (or prose). The most common positions are at center of the forestage or the
central area of the main stage. These two positions were studied in this research and two
more positions off to the side were chosen as well. Ideally, more positions should be
studied, but the limiting factor was the time required for each subject to go through all
acoustic conditions without causing their vocal fatigue.
8/3/2019 Wa Ms Thesis
59/80
48
Figure 27 Architectural plan of the main space at the RPI Playhouse. Dimension unit in inches. Blue
lines are dimensional guides. (CAD drawing courtesy of RPI Building Management)
3.2 Impulse Response AcquisitionBinaural room impulse responses were collected when the playhouse was unoccu-
pied. The auditorium of the 200-seat playhouse has an area of approximately 2650 sq. ft.
(246.1 sq. m.) and a volume of around 39750 cu. ft. (1125.5 cu. m.). The stage is located
opposite to the playhouse entrance separated by a proscenium with dimensions of 31.6 ft
by 14 ft. (9.6m by 4.2m). The stage level is 5 feet higher than the auditorium; and it has
an area of roughly 1260 sq. ft. (117 sq. m.) and a volume of 25200 cu. ft. (713.6 cu. m.).
Figure 27 shows the detailed dimensions. All seats were removed and the stage was
cleared during measurement sessions. The stage was set to a standard configuration of
masking flats and borders hung above stage.
8/3/2019 Wa Ms Thesis
60/80
49
Measurement was taken at four stage locations using the setup described in sec-
tion 2.4. The relative positions of the stage locations are shown on Figure 23. The acro-
nyms used stands for down stage center (DSC), down stage right (DSR), center
stage center (CSC) and center stage right (CSR). It should be noted that stage right
is defined as the right-hand-side from the actors perpective when he or she is facing the
audience. The four stage locations were 3 meters apart from each other.
Figure 28 Stage locations where BRIR was measured. Dashed line labeled "CL" is the center line of
the stage across the proscenium.
At each stage positions, the HATS were adjusted to three different head orienta-
tions manually with the aid of rotation angle guide marked on the top of the HATS. The
relative angles of the head rotation are shown in Figure 29. The height of the HATS was
8/3/2019 Wa Ms Thesis
61/80
50
maintained at 5 feet and 7 inches (about 1.7 meter) above the stage floor as described in
section 2.4.
Figure 29 Top view of the HATS showing 3 different head orientations in binaural room impulse
response acquisition.
Due to the background noise level at the playhouse, an excitation signal of pink
sweep sine of length 21.8s was used to achieve better signal-to-noise ratio (SNR). The
reported average SNR in all measurements was 52dB. Three averages were taken in each
measurement.
3.3 Subjective Test Design3.3.1 Preference ratings from paired comparison
In section 2.7, a 7-category rating method was chosen for a reason - to generate
compatible results to compare with previously published data. Nevertheless, it has uncer-
tainty in absolute judgment between subjects. Paired comparison can help reduce the ab-
solute judgment errors and also the relative judgment errors. In this chapter, different
combinations of acoustic conditions were given in pairs to the subjects in a randomized
order. In each of the paired comparison tests in this chapter, there were four different
acoustic conditions. The six pair-combinations were A-B, A-C, A-D, B-C, B-D and C-D.
Each pair was repeated twice, one in forward order while the other in reversed order.
This would give a total of 12 randomized pair-conditions for each subject in each test.
8/3/2019 Wa Ms Thesis
62/80
51
(The author first proposed 4 repetitions of each pair, resulting in 24 test conditions, but
test subjects reported that they had vocal fatigue and lost concentration after a certain
period of time (usually 30 minutes). Thus, in order to balance reliability of test results
and test subjects personal comfort, it was reduced to 12 pair-conditions in the stage
preference testing.)
3.3.2 Test procedures
In each comparison, the pair-conditions of (Indirect AC) were pre-loaded to preset
A and preset B in IR-1 (see Figure 10 and note the long A: " button below the
red circle named RTAS). The subjects were able to compare the pair-conditions by
asking the experimenter to A/B swap the presets and they allowed to spend as much time
as they wanted in each preset (condition). After auditioning both, the subjects were
asked, Which did you prefer? The preferred condition was scored as +1 and otherwise
as -1. Preference scores were added and normalized. A score of 1.0 indicates complete
unanimity of preference judgment of the acoustic condition, 0.0 means an equal number
of yes and no scores, and -1.0 means complete agreement on a negative preference.
After each complete set of paired comparison, the subjects were allowed to rest for
5 minutes before the next set of tests began. All other procedures follows section 2.5.
3.4 Paired Comparison Test on Stage Locations
The paired comparison tests were in three groups. Each group represents one head orien-
tation they were look center, look left and look right. Four stage locations were
studied in each group. The preference scores were obtained and analyzed.
8/3/2019 Wa Ms Thesis
63/80
52
3.4.1 Preference study of stage locations when head orientation is look center3.4.1.1 ResultThe preference scores for each subject were averaged over the 6 test sessions. Scores
were obtained for 4 Conditions (A,B,C and D) corresponding to the 4 stage locations
(DSC, DSR, CSC and CSR) respectively. The results of all 13 subjects are plotted indi-
vidually in Figure 30. Then, scores of all subjects in each condition were averaged and
plotted as well.
The result shows an agreement of preference on Condition A (DSC-down stage cen-
ter) and Condition C (CSC-center stage center) in all subjects while Condition A is
slightly more preferred than Condition C in the overall average. Another agreement
across all subjects is the negative preference on Condition D (CSR).
8/3/2019 Wa Ms Thesis
64/80
53
8/3/2019 Wa Ms Thesis
65/80
54
8/3/2019 Wa Ms Thesis
66/80
55
Figure 30 Preference score of different stage locations when head orientation is "look center"
(Conditions - A: DSC, B: DSR, C:CSC, D: CSR). Normalized scores of 13 individual subjects A-M
(blue bar graph) and overall average score of all subjects (red bar graph)
3.4.1.2 AnalysisUpon interviewing with the test subjects after the experiment, their common feedback
was the lateralization of the sound decay and the change in the sense of envelopment.
Some of the subjects had questions, during test sessions, of whether the volume of the
headphones was not balanced on the two channels. Subjects expressed an inclination to
spatially balanced and encapsulating sound fields. Interaural cross-correlation function
(IACF) is commonly used in analyzing the impact side reflections and subjective prefer-
ence of room width. IACF can also be used to visualize the lateralization of a running
sound source.
IACF is defined as:
IACFt(!) =
PL(t)P
R(t+ !)dt
t1
t2
"
PL
2(t)
t1
t2
" PR2(t)dt
t1
t2
"#$%%
&
'((
1/ 2
where L and R refer to the entrances to the left and right ear canals. The maximum pos-
sible value of IACF is one when both signals are the same. The variable accounts for
8/3/2019 Wa Ms Thesis
67/80
56
the time difference between the two ears and is varied over a range from -1 to +1 ms
from the first arrival [29].
The IACF for each condition was calculated in 100-ms intervals and plotted in Figure
31.
Figure 31 Interaural cross-correlation functions in 100ms-intervals for conditions A-D when head
orientation is "look center".
From the above IACF plots, the spike at time = 0ms indicates the highest correla-
tion between the two ears at the onset of the impulse response. As the sound decays, the
correlation rapidly reduces to around zero. This suggested the early reverberation field (0
to 400ms) was fairly diffused in all four conditions. The correlation rises towards the
late reverberation field (after 400ms). It should be noted that the rise in IACF does not
result in a prominent peak across the binaural sound field. This rise is apparent in both
8/3/2019 Wa Ms Thesis
68/80
57
Condition A & C, which suggests a correlation between the subjective preference score
and the behavior of the late reverberation field. The late reverberation in the playhouse
was believed to have contributed by the reflections from the back wall in the auditorium.
The absence of peak in late IACF implies the diffusiveness of the late reflections and a
sense of envelopment.
Furthermore, the energy decay was examined. Due to the fact that, in acquiring
the impulse response, the sound source (artificial mouth) and the receivers (binaural
ears) are in extreme close proximity, conventional reverberation time calculation cannot
be applied directly to inform the subjective sensation of the decay. It is also unknown
that how human perceives reverberation time of the same acoustic space differently
when listening to his or her own voice versus other sound sources. As a result, an alter-
native parameter, Voice Stage Support (VSS), was proposed to analyze the energy de-
cay:
VSSti = 10log10
p2 (ti )dtti !90
ti +10
"p2 (t)dt
0
10
"
where ti is the time interval every 100ms.
Since the energy of the direct sound from the artificial mouth to the ears is constant, it is
taken as a reference of initial energy. The energy ratio was calculated on every 100ms
interval after the initial 10ms. The results of VSS of