+ All Categories
Home > Documents > Wa Ms Thesis

Wa Ms Thesis

Date post: 06-Apr-2018
Category:
Upload: wa-yuen
View: 228 times
Download: 0 times
Share this document with a friend

of 80

Transcript
  • 8/3/2019 Wa Ms Thesis

    1/80

    Investigation of Voice Stage Support: Subjective Preference Test

    Using an Auralization System for Self-Voice

    by

    Cheuk Wa Yuen

    A Thesis Submitted to the Graduate

    Faculty of Rensselaer Polytechnic Institute

    in Partial Fulfillment of the

    Requirements for the degree of

    MASTER OF SCIENCE IN BUILDING SCIENCES,

    CONCENTRATION IN ARCHITECTURAL ACOUSTICS

    Approved:

    _________________________________________Professor Paul T. Calamia, Thesis Adviser

    _________________________________________Professor Ning Xiang, Ph.D.

    Rensselaer Polytechnic InstituteTroy, New York

    June, 2007(For Graduation August 2007)

  • 8/3/2019 Wa Ms Thesis

    2/80

    ii

    Copyright 2007

    by

    Cheuk Wa Yuen

    All Rights Reserved

  • 8/3/2019 Wa Ms Thesis

    3/80

    iii

    CONTENTS

    LIST OF TABLES.........................................................................................................v

    LIST OF FIGURES...................................................................................................... vi

    ACKNOWLEDGMENT .............................................................................................. ix

    ABSTRACT................................................................................................................. xi

    1. Introduction..............................................................................................................1

    1.1 Aim of the Thesis.............................................................................................3

    1.2 Historical Review ............................................................................................5

    1.2.1 Stage Acoustics and Support................................................................5

    1.2.2 Previous Researches on Subjective Preferences on Stage Acoustics .....7

    1.2.3 Self-Voice Perception..........................................................................7

    1.2.4 Experimental Setup in Previous Self-voice Auralization and RelatedSound Field Simulation......................................................................10

    1.3 Thesis Outline................................................................................................12

    2. Self-voice Auralization System: Design and Implementation..................................13

    2.1 Experimental Design Concept ........................................................................13

    2.2 Measurement Setup........................................................................................13

    2.3 Binaural Real-time Auralization System ........................................................15

    2.3.1 System Overview...............................................................................15

    2.3.2 Implementation of Direct Air Conduction Modeling ..........................19

    2.3.3 Implementation of Indirect Air Conduction Modeling........................23

    2.3.4 Implementation of Headphone Equalization.......................................25

    2.4 BRIR Acquisition System ..............................................................................26

    2.5 Experimental Procedures in Subjective Test...................................................30

    2.5.1 Test Subject Conditioning..................................................................30

    2.5.2 Use of Dramatic Text in Study of Actors............................................30

    2.5.3 Verifying the Consistency of Self-voice Stimuli by Monitoring the

    Pace of Speech...................................................................................31

  • 8/3/2019 Wa Ms Thesis

    4/80

    iv

    2.6 HATS Verification Tests................................................................................34

    2.6.1 Binaural Microphones........................................................................35

    2.6.2 Artificial Mouth.................................................................................36

    2.7 Subjective Test on Naturalness of the Auralization System ............................41

    2.7.1 Evaluation on Naturalness of CIL Filter Delay Time..........................42

    2.7.2 Evaluation on Naturalness of CIL Filter Level ...................................45

    2.8 Discussion .....................................................................................................46

    3. Subjective Preference Tests on Stage Acoustic Conditions for Actors.....................47

    3.1 Introduction ...................................................................................................47

    3.2 Impulse Response Acquisition .......................................................................48

    3.3 Subjective Test Design...................................................................................50

    3.3.1 Preference ratings from paired comparison ........................................50

    3.3.2 Test procedures..................................................................................51

    3.4 Paired Comparison Test on Stage Locations...................................................51

    3.4.1 Preference study of stage locations when head orientation is look

    center ...............................................................................................52

    Preference study of stage locations when head orientation is look left ........59

    3.4.2 Preference study of stage locations when head orientation is look

    right .................................................................................................65

    3.4.3 Discussion on Stage Location Preference...........................................70

    4. DISCUSSIONS ......................................................................................................71

    4.1 Reflections on subjective preferences on stage acoustic conditions.................71

    4.2 Accuracy in subjective testing........................................................................71

    4.3 Potential ways of improving voice stage support in proscenium theaters ........72

  • 8/3/2019 Wa Ms Thesis

    5/80

    v

    LIST OF TABLES

    Table 1. Direct AC Filter Settings (Rane PE-15)...........................................................20

    Table 2. CIL filter settings (digital parametric equalizer on 02R)..................................23

    Table 3. Headphone compensation filter settings (02R master output) ..........................26Table 4. Delay time in evaluation test of naturalness of Direct AC insertion loss

    compensation filter (CIL filter). *The current system has a processing delay of

    0.14ms with a setting of 0.01ms on DN716. ** Prschmann's tested delay times

    were based on taps of a 48kHz Tucker Davis DSP system, and they are represented

    here in millisecond for convenience of comparison...............................................43

  • 8/3/2019 Wa Ms Thesis

    6/80

    vi

    LIST OF FIGURES

    Figure 1. Components of perception of self-voice...........................................................9

    Figure 2 Earthwork M30 omni-directional microphone.................................................14

    Figure 3 Countryman B3 miniature omni-directional microphone.................................14Figure 4 Transfer function from Earthwork M30 to Countryman B3.............................15

    Figure 5 Binaural self-voice auralization system block diagram....................................16

    Figure 6 Test subject with microphone and pop filter....................................................17

    Figure 7 MRP-to-ERP Transfer Function & Direct AC Filter using PE-15 parametric

    equalizer (1/3 octave smoothing) ..........................................................................20

    Figure 8 Setup for measuring the headphone's insertion loss using an isolation tube. ....21

    Figure 9 Isolation tube used in insertion loss measurement ...........................................22

    Figure 10 Waves IR1 Convolution Reverb, loaded with a 3-second unit sample sequence

    .............................................................................................................................24

    Figure 11 Impulse response trimming before importing to IR-1....................................24

    Figure 12 Headphone response and compensation filter (02R master output)................25

    Figure 13 Frontal plane section of HATS, showing binaural microphones and the related

    fittings. (Adapted from manufacturers manual)....................................................27

    Figure 14 Median plane section of HATS, showing the artificial mouth .......................28

    Figure 15 Binaural room impulse response acquisition system block diagram, showing

    how the binaural ears and artificial mouth of the HATS are connected..................29

    Figure 16 Example plot of effective duration of running autocorrelation function of two

    recordings of the same text at different pace, showing the use of e min as a temporal

    reference for monitoring subjective testing............................................................33

    Figure 17 Verification Test of HATS in anechoic chamber at General Electric

    Laboratory (NY)...................................................................................................34

    Figure 18 Frequency response comparison between HATS binaural microphones ........35Figure 19 Voice directivity of HATS (a) horizontal plane (b) vertical plane, in 4 octave

    bands (250Hz, 500Hz, 1khz & 2kHz)....................................................................36

    Figure 20 Comparison of voice directivity, in 3 ocatve bands (500Hz, 1kHz & 2kHz)..37

    Figure 21 On-axis frequency response of HATS artificial mouth. Overlays of no-

    smoothing and 1/6-octave smoothing...............................................................38

  • 8/3/2019 Wa Ms Thesis

    7/80

    vii

    Figure 22 Frequency Response of B&K Artificial Mouth Type 4128C [adapted from

    manufacturers datasheet] .....................................................................................39

    Figure 23 MRP-to-ERP transfer function of HATS (1/3-octave smoothing)..................40

    Figure 24 Averaged frequency response of MRP-to-ERP (Direct AC) of 18 human

    subjects. The grey area marks the satndard deviation. (Adapted from Porschmann

    [2000])..................................................................................................................41

    Figure 25 Subjective evaluation of naturalness of delay time in Direct AC auralization.

    Mean score and error rate of all subjects (95% confidence)...................................44

    Figure 26 Architectural plan of the main space at the RPI Playhouse. Dimension unit in

    inches. Blue lines are dimensional guides. (CAD drawing courtesy of RPI Building

    Management)........................................................................................................48

    Figure 27 Stage locations where BRIR was measured. Dashed line labeled "CL" is the

    center line of the stage across the proscenium. ......................................................49

    Figure 28 Top view of the HATS showing 3 different head orientations in binaural room

    impulse response acquisition.................................................................................50

    Figure 29 Preference score of different stage locations when head orientation is "look

    center" (Conditions - A: DSC, B: DSR, C:CSC, D: CSR). Normalized scores of 13

    individual subjects A-M (blue bar graph) and overall average score of all subjects

    (red bar graph)......................................................................................................55

    Figure 30 Interaural cross-correlation functions in 100ms-intervals for conditions A-D

    when head orientation is "look center". .................................................................56

    Figure 31 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D:

    CSR) when head orientation is "look center".........................................................58

    Figure 32 Preference score of different stage locations when head orientation is "look

    left" (Conditions - A: DSC, B: DSR, C: CSC, D: CSR). Normalized scores of 13

    individual subjects A-M (blue bar graph) and overall average score of all subjects

    (red bar graph)......................................................................................................62

    Figure 33 Interaural cross-correlation functions in 100ms-intervals for conditions A-D

    when head orientation is "look left". .....................................................................63

    Figure 34 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D:

    CSR) when head orientation is "look left.............................................................64

  • 8/3/2019 Wa Ms Thesis

    8/80

    viii

  • 8/3/2019 Wa Ms Thesis

    9/80

    ix

    ACKNOWLEDGMENT

    I am grateful for professor Paul Calamia and his willingness to share his knowl-

    edge and wisdom. I would also thank Dr Ning Xiang for his insights and meticulous

    training in laboratory work and research; and Dr Jonas Braasch for an enjoyable class inpsychoacoustics.

    This research would not be possible without the help from Mr. David Larson for his

    generosity in lending me the Brel & Kjr head and torso simulator. My gratitude also

    goes to Mr Bob Hedeen at General Electric Laboratory (NY) for letting me use the an-

    echoic chamber for numerous acoustical measurements.

    A thunderous applause goes to all participating actors in this research. Speaking in

    an isolated environment without the interaction with other actors or audience had been

    the most difficult experience for the artists. Your patience and concentration were pro-

    fessional. Without your support, there will be no study of voice stage support.

    My study in the United States was made possible by the support of the prestigious

    Sir Edward Youde Memorial Fund Fellowship in Hong Kong. I hereby send my dedica-

    tion to late Sir Edward Youde. His wife, Pamela Youdes continuous encouragement

    means a lot to me. Respect also goes to all officers of the fellowship council, especially

    Ms Carnelia Fung.

    I sincerely thank for all my mentors and incredible educators from whom I learned a

    lot throughout the years at Rensselaer Polytechnic Institute, California Institutes of the

    Arts and the Hong Kong Academy for Performing Arts.

    Last but not least, I thank my family for their continuous support. This thesis is

    dedicated to my parents, late grandmother, Fung Yau Hau, and my brother Cheuk Chi

    Yuen who is recovering from speech disorder after a stroke in summer 2006. His speech

    therapy session of repetitive reading is an irony to my research.

  • 8/3/2019 Wa Ms Thesis

    10/80

    x

    Not everything that can be counted counts, and not everything that counts

    can be counted. [4]

    Albert Einstein

  • 8/3/2019 Wa Ms Thesis

    11/80

    xi

    ABSTRACT

    The human voice plays an integral role in dramatic art. The performance of singers and

    actors, who perceive their voice through their ears as well as bone conduction, is highly

    related to the acoustic condition they are in. Due to the proximity of the sound sourceand the spectral difference in the transmission through the skull as compared to air, a

    support condition different from that for musical instrumentalists is needed. This paper

    aims at initiating a standardization of methodology in subjective preference testing for

    voice stage support in order to collect more data for statistical analysis. A proposal of an

    acquisition/auralization system for self-voice and a set of subjective test procedures are

    presented. The subjective evaluation of the system is compared to previous designs re-

    ported in the literature, and the implementation is validated. A small playhouse has been

    measured and auralized using the system described, and subjective-preference tests have

    been conducted with 13 professionally trained actors. Their preferred stage-acoustic

    conditions (in relation to locations on stage and head orientations) are reported. The re-

    sults show potential directions for further investigations and identify the necessary con-

    cerns in developing an objective parameter for voice stage support.

  • 8/3/2019 Wa Ms Thesis

    12/80

    1

    1. IntroductionIn the course of theater history, from classical Greek drama to Shakespearean plays

    to Ibsens naturalistic plays to 20th

    century Broadway rock musicals, human voice has

    always been an integral part of the dramatic art. The success of this art largely relies onhow well the audience understands the words voiced by the actors. This rule has not

    changed for more than 2,300 years since the day of Lycurgian Theater of Dionysus in

    Greece (the first great permanent theater in recorded history) [1].

    While most contemporary architects and acousticians are focusing on auditorium

    acoustics in the design of performance spaces, the special acoustical needs required by

    musicians, singers and actors are often less emphasized. Although acoustic shells have

    been developed in concert halls and achieved a certain degree of success, opera houses

    and theaters do not receive the same attention. Stage performers are left to adapt to the

    acoustics of the space as best they can. [2] In many cases, performers would find it diffi-

    cult to hear themselves or each other intelligibly, and thus not manage to achieve their

    best tonality; and in extreme cases, they would fail to attain pitch accuracy and coher-

    ency individually or in the ensemble. This may result in a less-than-satisfactory perform-

    ance. That means the performer-audience communication is not achieved and ultimately

    affects how the audience would rate the performance and possibly the acoustics of the

    venue. It is strongly suggested that stage acoustics demands as much attention as audito-

    rium acoustics deserves. It is logical that an optimal stage acoustics is fundamental to a

    good overall rating of the acoustics in a performance space. (Visual appeal also plays a

    role in audiences rating on the acoustics, but it is out of the scope of discussion in this

    thesis.)

    There is currently no parameter in international standards quantifying stage acous-

    tics. Among all acoustical parameters widely used in the industry, only one is generally

    accepted as a means of quantifying the ease of listening and performing on stage Sup-

    port (ST1, ST2), first proposed by A.C. Gade in 1989, which is intended to measure the

    contribution of early reflections to the sound from the musicians own instrument. [3]

  • 8/3/2019 Wa Ms Thesis

    13/80

    2

    Gades proposal is however limited to instrumentalists only. For singers and actors

    (or voice performers as a consistent terminology for the rest of the thesis) whose in-

    strument is the human voice, Support cannot be applied directly because of the influence

    of bone conduction in the perception of self-voice. Moreover, Support fails to address

    the frequency spectrum and orientations of early reflections; and directions of late rever-

    beration which might be determining factors as well. The practicality of using a single

    parameter to represent voice performers preferred stage acoustic condition remains

    uncertain.

    But one thing is clear whether it is to propose a new acoustical parameter or to

    validate the effectiveness of a stage acoustics design, subjective preference test is the

    only viable means of solving the problem. Every human being is uniquely different from

    each other and our preferences to a certain acoustic condition remain highly subjective

    and may vary enormously. The preferred stage acoustic condition depends on ones own

    voice quality and auditory behavior.

    A new study in stage acoustics, called Voice Stage Support (VSS), is proposed to

    investigate auditory feedback on stage for professional voice performers. It thus ex-

    cludes normal speech communication between common people. It is not the objective of

    this thesis to comprehensively define VSS or devise a new parameter experimentally. It

    is rather an initiation of ground work in promoting the study of this uniquely different

    field in opera house and theater acoustics which involves acoustical design, psychoa-

    coustics and performance psychology.

    One may argue the tremendous difficulties or the impossibility in generalizing audi-

    tory preferences for the entire human population, and it remains a challenge for acousti-

    cians.

    From error to error, one discovers the entire truth.

    - Sigmund Freud

  • 8/3/2019 Wa Ms Thesis

    14/80

    3

    1.1 Aim of the ThesisAs discussed above, in order to study Voice Stage Support (VSS), subjective prefer-

    ence tests are inevitable. There are some prominent difficulties in conducting such tests.

    In statistical analysis, the key to success is to have a large number of samples in the

    population. Thus, it takes a long time for any single researcher to acquire enough data

    for analysis and come up with a convincing conclusion. It is particularly difficult for

    VSS because professional voice performers constitute only a very small portion of the

    human population. It is an effort of numerous researches before any comprehensive the-

    ory of VSS can be accomplished. The more subjective data collected, the better the de-

    velopment of the study.

    Unlike most auditory experiments which involve external stimuli (sound sources

    outside ones body), the perception of self-voice strictly requires ones own vocalization

    to generate the sound stimuli for the test which demands real-time auralization of audi-

    tory scenes and inhibits the use of pre-recorded and pre-processed test stimuli.

    The demand of real-time auralization implies a system capable of low propagation

    delay (or processing delay). Previous similar acoustical researches often require special-

    ized digital signal processing (DSP) equipment which means very few facilities are

    qualified to repeat such tests to generate compatible results. However, there are now al-

    ternatives since digital audio signal processing becomes more widely available and af-

    fordable in the professional audio industry.

    The argument here is an easily obtainable, reproducible and repeatable setup for

    real-time auralization of self-voice would greatly promote this field of study by enabling

    more researchers who have access to professional singers and actors to conduct such

    tests and thus enlarging the sample base in the aggregation of compatible data for long-

    term statistical analysis.

  • 8/3/2019 Wa Ms Thesis

    15/80

    4

    The first objective of this thesis is to verify the reliability of a more accessible real-

    time auralization setup, compared to previous experimental systems found in literature.

    It is also a process in standardizing psycho-acoustical experimental procedures involving

    real-time self-voice auralization and the respective data acquisition, both in terms of

    hardware set up and subject conditioning, for the purpose of voice stage support study.

    This thesis includes a subjective evaluation of the stage in a 200-seat playhouse. Various

    stage locations and head orientations are compared using the proposed auralization setup

    and procedures.

  • 8/3/2019 Wa Ms Thesis

    16/80

    5

    1.2 Historical ReviewThis section briefly covers the issues related the thesis. It first summarizes the field of

    stage acoustics and support for performers followed by the difference and difficulties of

    support for voice performers as compared to musicians (section 1.2.1 & 1.2.2). The psy-

    chophysics of self-voice perception is introduced in section 1.2.3. Previous subjective

    preferences tests and their auralization setup are reviewed in section 1.2.4.

    1.2.1 Stage Acoustics and SupportStage acoustics can be defined as the study of acoustic conditions of where the per-

    formers are located in a performance. In many occasions, performers are located in a

    stage house or a stage volume which is spatially discriminated (yet not isolated) from thephysical volume of the auditorium. It is particularly obvious in proscenium theaters and

    opera houses. In other settings, such as theater-in-the-round or multi-functional/ modular

    theaters, the separation between stage and auditorium acoustic space is less distinguish-

    ing and may overlap with each other. Whatever the setting is, performers demand a cer-

    tain condition so that they can perform comfortably.

    Stage support usually refers to the amount of auditory feedback of ones own in-

    strument, which enables the performer to hear him/herself with ease and he/she does not

    need to force the instrument to develop the tone. In A.C. Gades pioneering work [5], it

    is translated into an objective parameter, SUPPORT, which includes three measures of

    energy ratios (ST1, ST2 and STlate) in the sound fields.

    ST1= 10log10E(20,100ms)

    E(0,10ms)

    ST2 = 10log10E(20,200ms)E(0,10ms)

    STlate

    = 10log10E(100,!ms)

    E(0,10ms)

  • 8/3/2019 Wa Ms Thesis

    17/80

    6

    After a few applications and analysis, they were later revised by Gade [6] as:

    STearly = 10log10E(20,100ms)

    E(0,10ms)

    STtotal

    = 10log10E(20,1000ms)

    E(0,10ms)

    STlate

    = 10log10E(100,1000ms)

    E(0,10ms)

    where E () stands for the time integral of the squared pressure signal of an impulse

    response between the time limits reported in the brackets. In the above definitions, t = 0

    is the arrival time of the direct sound. Unit is in dB.

    SUPPORT has been applied in various studies of acoustics for performers [7][8][9],

    and is generally agreed with performers subjective preferences. Nevertheless, there are

    a few points that attract our attention. Firstly, Gades setup suggests that the measure-

    ment microphone position is one meter (roughly the maximum distance between the

    players ears and his instrument) in front of the sound source. Secondly, single micro-

    phone is used in the measurement.

    Gade reported that STearly unexpectedly succeeded in describing the ease of hearing

    other musicians rather than its intended purpose [6]. Although the reliability has yet to

    be ascertained over a longer period of time, some acoustical consultants have been using

    it as a parametric guideline.

    However closely related to performers support, it is not applicable in the case of singers

    and actors because the instrument concerned - the human voice - is in close proximity to

    the ears and there exists a fundamental difference in the perception of self-voice than

    that of any other music instruments.

  • 8/3/2019 Wa Ms Thesis

    18/80

    7

    1.2.2 Previous Researches on Subjective Preferences on Stage AcousticsAll previous researches indicated musicians preference (including music instru-

    mentalists and singers) on early reflections in support of their performance. Marshall and

    Meyer, in 1985, reported singers prefer a strong presence of reverberation, while early

    reflections were weakly preferred. [10]

    Noson, later in 2000, reported that singers preferred longer delay time of reflections

    than musicians due to the masking effect of bone conducted sound inside the head [11].

    He also discovered that melisma singing style (non-plosive, non-fricative syllables) re-

    sulted in a shift in preferred delay time of reflections [12]. It indicated that subjective

    preference is content-dependent of the sound source. In Nosons works, it was also

    proved that singers subjective preference on delay time of a single reflection is propor-

    tional to the minimum effective duration (e min) of the running autocorrelation function

    (ACF) of the sound source. This is in direct agreement with Andos previous research on

    audiences subjective preference in concert halls [13]. Andos proposal is thus believed

    to be applicable to musicians and singers as well.

    Nosons work strongly supported that the unique nature of self-voice perception is

    the most significant contributing factor of a different preference pattern for voice per-

    formers as compared to instrumentalist.

    1.2.3 Self-Voice PerceptionPerception of self-voice is constituted by air conduction (sound wave propagations

    from mouth to ear) and bone conduction (vibrations from voice organs to the ear inside

    the human head).

    The air conduction path includes mainly the diffraction of sound coming out from

    the mouth opening, across the surface of the head and into the ear canal. It also includes

    all transmissions of vibrations of the vocal tissues from the surface of the head into air,

  • 8/3/2019 Wa Ms Thesis

    19/80

    8

    and back to the ear canal. However, this latter component is believed to be of negligible

    contribution to our hearing [14].

    The role of bone conduction was not well understood until Georg von Bksy [15],

    in 1949, identified bone conduction and air conduction as the sound paths pertinent to

    perceiving ones self-voice. Estimations derived from his various investigations show

    that the perceived loudness of bone conduction is of the same order of magnitude as that

    of the air conduction. According to a more contemporary study of bone conduction by

    Stenfelt and Goode [16], the bone conduction path can be divided into four components,

    (1) sound radiation into the ear canal, (2) inertial motion of the middle ear ossicles, (3)

    inertial motion of fluid in the cochlea, and (4) compression and expansion of the bone

    encapsulating the cochlea.

    In most occasions, natural human voice is found in the presence of an acoustic envi-

    ronment. With the inclusion of acoustic space, the air conduction path can be further di-

    vided into two - direct air-conduction (mouth-to-ear) and indirect air-conduction (specu-

    lar reflections from boundaries in the acoustic environment).

    Hence, the paths constituting the perception of self-voice can be summarized as:

    Direct Air Conduction (Direct AC) - from mouth to ear

    Bone Conduction (BC) - through skull

    Indirect Air Conduction (Indirect AC) - reflections of voice off room boundaries

    *Direct AC, BC and Indirect AC are used, throughout this thesis, to denote the above

    auditory paths.

    Their relationship is represented graphically in a simplified fashion in Figure 1.

  • 8/3/2019 Wa Ms Thesis

    20/80

    9

    Figure 1. Components of perception of self-voice

    The spectral characteristics of the above pathways can be identified with human subjects.

    For Direct AC, it is usually obtained by measuring the transfer function between the

    sound pressure at microphones placed at mouth reference point (MRP) and ear reference

    point (ERP) in an anechoic environment, in which human subjects recite a selection of

    words effectively covering the vocal frequency range, as demonstrated by Prschmann

    [17] as well as Williams and Barnes [18]. For BC, direct measurement cannot be applied.

    It is determined by measuring the masked threshold of pure-tone or narrow-band noise

    while the air-conducted sound is removed (or highly attenuated) [17]. In general, the

    threshold increases as frequency rises. Nevertheless, in Stenfelts research [19], it is

    found that sensitivity in loudness perception in bone conduction is higher than that in air

    conduction. And this trend is progressively more drastic as listening level increases,

    suggesting that the loudness contour of bone conduction is different from that of air con-

    duction. (Air conduction loudness contour refers to the Fletcher and Munson curve in

    1933. [20]) To determine Indirect AC, method similar to that for Direct AC can be used

    with human subjects. Binaural receivers can be fitted in the subjects ear canals. By

  • 8/3/2019 Wa Ms Thesis

    21/80

    10

    measuring the impulse response of the receiver at MRP and the binaural receivers in the

    ears, the transfer function of the room can be determined. Then, the Indirect AC can be

    isolated by properly removing the direct sound from the impulse response.

    For Direct AC and BC, an average result can be collected from a group of subjects in the

    laboratory. However, for Indirect AC, it requires bringing a large number of subjects to

    each acoustic condition being examined (i.e. different concert halls and different stage

    positions) which is impractical in most cases. An alternative approach is discussed in

    Section 2.1 in this thesis.

    1.2.4 Experimental Setup in Previous Self-voice Auralization and Related SoundField Simulation

    In the subjective evaluation of a sound field for singers or actors, owning to the use

    of self-voice as the sound stimulus, one must create an auralization setup capable of re-

    producing (1) the direct mouth-to-ear air conduction (if such path is to be obstructed by

    the reproduction system) and (2) the convolution product between the live signal and the

    impulse response of the sound field under test, all in real-time.

    At the source pickup end, there is no consistency in previous experiments. A micro-

    phone is usually placed in front of the mouth. However, the microphone type and micro-

    phone-to-mouth distance vary greatly between researches. In Marshall and Meyers

    setup [10], a cardioid microphone is placed at 0.5m directly pointing at the mouth; in

    Nosons setup [11], a small headset microphone (no polar pattern specified) is located at

    10cm in front and 5cm below the mouth; and in Prschmanns experiment [21], a Senn-

    heiser KE4 omni-directional miniature microphone is positioned precisely at the mouth

    reference point (MRP40) 40mm in front of the lips, with a holding device attached to

    the headphones harness.

    At the sound reproduction end, there are generally two different approaches (1)

    spatially distributed loudspeakers for reproduction of delayed reflections, as found in

    Marshall and Noson; and (2) open-back circumaural headphones with compensation fil-

    ters for binaural sound field simulation, found in Prschmann. The two reproduction ap-

  • 8/3/2019 Wa Ms Thesis

    22/80

    11

    proaches have their pros and cons. Using loudspeakers inherently create the possible

    acoustic feedback path which means, for instance, the delayed reflection is picked up by

    the microphone and then more reflections are regenerated again through the system. This

    leads to unintended stimuli and eventually affects the accuracy of the subjective test. Its

    advantage is that subjects are free of restraint by any body attachment. However, the

    speaker system requires a comparatively larger space and is usually not portable.

    The headphone system, on the contrary, is less demanding for the laboratory space

    and is fairly portable and easy to set up. The disadvantage of using headphone system is

    the necessity of implementing a compensation filter for the direct mouth-to-ear air con-

    duction path because, even when open-back circumaural headphones are used, the head-

    phone enclosures inevitably impose a sound insertion loss between the mouth and the

    ears. High frequencies are usually attenuated. Moreover, the compensated (filtered) sig-

    nal needs to be delayed before reaching the headphones so that it is in sync with the

    natural air conduction to avoid comb-filtering effect. Prschmann [21] has shown that

    such reproduction scheme is successful in achieving a certain degree of naturalness in

    virtual auditory environment. Another issue with headphones is the occlusion effect and

    return of radiation by the human head. Occlusion effect refers to the accentuation of sen-

    sitivity in bass frequencies when the ear canal is obstructed. Details of the occlusion ef-

    fect can be found in literature written by Tonndorf [22] and Dean [23]. Open-back head-

    phones can minimize such effect and have been accepted and used in experiments pro-

    vided that there is enough padding between the headphone hardware and the test subject.

  • 8/3/2019 Wa Ms Thesis

    23/80

    12

    1.3 Thesis Outline

    In Chapter 2, the proposed self-voice auralization system dedicated for the investigation

    of voice stage support is introduced. It includes the binaural impulse response acquisi-

    tion system, the binaural auralization system and the subjective preference test proce-

    dures. Also, a validation test with human subjects to obtain subjective rating on the natu-

    ralness of the system is reported. The results were compared to evaluation of setups in

    previous researches. The proposed system was used to investigate stage acoustic condi-

    tions of a playhouse. Chapter 3 shows the results and analysis of actors subjective pref-

    erences on various stage locations and head orientations. Chapter 4 discusses the ex-

    perimental results, followed by suggestions in directions for future work in the field of

    voice stage support.

  • 8/3/2019 Wa Ms Thesis

    24/80

    13

    2. Self-voice Auralization System: Design and Implementation

    2.1 Experimental Design ConceptAs discussed in section 1.2.3, using human subjects to obtain an average of Indirect

    AC data is impractical when a lot of acoustic spaces and conditions have to be examined.

    Portability and repeatablility were the first criteria of the current design. In order to

    achieve that, a dummy head was proposed to substitute the human head in the measure-

    ment and acquisition process. An artificial mouth was used to represent the human voice

    source. Dunn and Farnsworth [24] showed that a persons own voice can be modeled by

    a source at the opening of the mouth. Similar approach has been taken and examined by

    Bozzoli, Viktorovitch and Farina [25].

    The design consists of three basic components:

    - Binaural Room Impulse Response (BRIR) Acquisition

    - Binaural Real-time Auralization

    - Experimental Procedures for Test Subjects

    Since the experimental design was logically driven by the implementation of aural-

    izing the conduction paths, the design of real-time auralization is first described (section

    2.3) followed by the BRIR acquisition (section 2.4) and experimental procedures (sec-

    tion 2.5).

    2.2 Measurement Setup

    In this thesis, the acoustical measurement system was Electronic & Acoustic System

    Evaluation & Response Analysis (EASERA) v1.0.60 software running on a Pentium M

    processor based PC, with a Sound Devices USB Pre (USB-powered audio interface) for

    audio input/output. Sampling rate was set at 48 kHz and bit depth is 16. Excitation signal

  • 8/3/2019 Wa Ms Thesis

    25/80

    14

    was a pink sweep sine with 1 pre-send and 3 averages customized in EASERA unless

    otherwise stated.

    Two measurement microphones were used. They were Earthwork M30 and Countryman

    B3 omni-directional microphones (Figure 2 & Figure 3). The M30 was chosen for its

    high sound pressure capability whereas the B3 is chosen for its compact size for meas-

    urement positions where M30 is unable to reach. Their transfer functions were first

    measured in order to compensate for their difference in frequency characteristics when

    they were used simultaneously. A pink sweep sine was produced using a Yamaha MSP5

    2-way studio monitor speaker at a distance of 1m in front of the microphones. The mi-

    crophones outputs were recorded using the above EASERA setup and the transfer func-

    tion was then obtained for use as an equalization function in subsequent calculations.

    Figure 4 shows the microphones transfer function.

    Figure 2 Earthwork M30 omni-directional microphone

    Figure 3 Countryman B3 miniature omni-directional microphone

  • 8/3/2019 Wa Ms Thesis

    26/80

    15

    Loudspeaker for excitation source is an artificial mouth in a dummy head unless other-

    wise stated. The detailed structure of the dummy head is described in section 2.4.

    All measurements were conducted in a hemi-anechoic chamber unless otherwise stated.

    Figure 4 Transfer function from Earthwork M30 to Countryman B3

    2.3 Binaural Real-time Auralization System2.3.1 System Overview

    In this research, only Direct AC and Indirect AC need to be implemented. Since

    human subjects own voice is used as sound stimuli in real-time auralization, the bone

    conduction component is produced naturally inside the subjects head. The auralization

  • 8/3/2019 Wa Ms Thesis

    27/80

    16

    system used a topology of two separate paths to model the direct air conduction (Direct

    AC) and indirect air conduction (Indirect AC). All auralizations were conducted in a

    hemi-anechoic chamber.

    Figure 5 Binaural self-voice auralization system block diagram

    Figure 5 shows the system block diagram. The setup used an Earthwork M30 omni-

    directional microphone to pick up the subjects voice. It was positioned at the mouth-

    reference-point MRP (80mm from the lips) separated from the subjects mouth by a

    metal-grille pop-filter mounted at 40mm from the diaphragm so as to eliminate micro-

    phone diaphragm excursion caused by plosive sounds. Figure 6 shows the relationship

    between the subjects mouth, the pop filter and the microphone. The microphone signal

    was split into two and connected to input channels 1 & 2 (Ch1 & Ch2) on a Yamaha 02R

    digital mixing console, with identical and repeatable gain setting using the step-gain con-

    trol on the pre-amplifiers. The gain setting was optimized to achieve a peak at -10 dBFS

    using a mic calibrator of 1kHz sine tone at 105 dBA. The resulting line-level signals

    were routed to two paths, Path 1 & Path 2, modeling the Direct AC and Indirect AC re-

    spectively.

  • 8/3/2019 Wa Ms Thesis

    28/80

    17

    Figure 6 Test subject with microphone and pop filter.

    (Path 1) Through the channel-insert-send before the A/D stage on the Ch1, the pre-

    amplified signal was connected to a Klark Teknik DN-716 single-channel digital delay

    unit (with built-in 16-bit A/D & D/A conversion) cascaded with a Rane PE-15 4-band

    analog parametric equalizer. The analog output was returned to the channel-insert-return

    of Ch1, going into the A/D conversion stage on the 02R. The delay unit and parametric

    equalizer were used to model the mouth-to-ear propagation delay and transfer function

    respectively. Their implementations are further described in section 2.3.2. The 02R on-

    board digital equalizer on Ch1 was used as the compensation filter for the insertion loss

    introduced by the auralization headphones. Details are described in section 2.3.4.

    (Path 2) Through Ch2, the signals was A/D converted and digitally routed, via an optical

    connection (TOSLINK) to a Digidesign TDM MIX digital audio workstation (Motorola

  • 8/3/2019 Wa Ms Thesis

    29/80

    18

    DSP-based PCI mixing engine running in a Macintosh dual processor 500MHz G4 com-

    puter) using a Digidesign ADAT Bridge digital interface. The workstation was running

    ProTools audio software with Waves IR1 (dual-channel convolution reverberation plug-

    in) through which a BRIR can be loaded and convolved with the incoming signal. The

    output (convolved) signal was returned to the ADAT IN (TAPE IN 1) on the 02R mixer

    via the ADAT Bridge digitally. The setup described was used to model the Indirect AC

    or called the room response. The convolution implementation is further described in sec-

    tion 2.3.3.

    Both returns from Path 1 and Path 2 were internally routed to 02Rs main stereo output

    in the digital domain.

    The stereo output of the 02R was connected to a Samson HP-5 headphone amplifier

    driving a pair of Audio-Technica ATH-A700 open-back headphones. A compensation

    filter was implemented using the on-board equalizer on the 02R stereo output channel to

    remedy the frequency anomalies induced by the headphones. It is described in section

    2.3.4

    The A/D & D/A conversions in the 02R and DN-716 are all 16-bit, 48kHz. Each conver-

    sion stage introduces a processing delay of 0.02ms. The processing delay of Path 1

    measured 0.14ms (A/D & D/A conversion and filter network in DN-716 and conversion

    stages in 02R) when DN-716 is at its lowest setting 0.01ms whereas the processing delay

    of Path 2 measured 11.74ms (A/D & D/A conversion plus latency of IR1 [11.6ms])

    while IR1 was engaged and loaded with a 3-second long unit-sample sequence. All lev-

    els were set at unity gain during the delay measurement.

  • 8/3/2019 Wa Ms Thesis

    30/80

    19

    2.3.2 Implementation of Direct Air Conduction Modeling

    In Path 1, which is designed to model the Direct AC, the MRP-to-ERP transfer function

    (measured in section 2.3.2.3) is approximated using the four-band parametric equalizers

    PE-15, the digital delay DN-716 and the internal equalizer on the 02R.

    2.3.2.1 Determining the PE-15 filter settingThe frequency response of the MRP-to-ERP impulse response was approximated using

    an analog parametric filter. The precise settings of the filters were determined by over-

    laying the transfer function of the PE-15 against the MRP-to-ERP magnitude-spectrumplot. Using the Live mode in EASERA, the MRP-to-ERP plot was pre-loaded. A pink

    noise was fed to PE-15 at line level and itss output was connected directly back to

    EASERA to obtain a live magnitude-spectrum plot while adjusting the PE-15 settings.

    Figure 7 shows an overlay magnitude-spectrum plot of the impulse responses of MRP-

    to-ERP and the determined filter settings in PE-15 (see Table 1). The plot was generated

    in Matlab.

  • 8/3/2019 Wa Ms Thesis

    31/80

    20

    Figure 7 MRP-to-ERP Transfer Function & Direct AC Filter using PE-15 parametric equalizer (1/3

    octave smoothing)

    Table 1. Direct AC Filter Settings (Rane PE-15)

    Direct AC Filter Band 1 Band 2 Band 3 Output Level

    Gain (dB) +4.0 -5.5 -8.0 -18.0

    Frequency (Hz) 90 800 7k -

    Q 1.2 0.26 0.45 -

  • 8/3/2019 Wa Ms Thesis

    32/80

    21

    2.3.2.2 Determining the DN-716 delay timeThe initial arrival time of the ERP-to-MRP impulse response was implemented using a

    digital delay line. The mean value of MRP-to-ERP propagation delay in human is 300 s

    (or 0.3ms) as reported by Prschmann [17]. Thus, by subtracting the processing delay

    0.14ms, the delay time to be inserted is 0.16ms (corresponding to a panel display of

    0.17ms on the DN-716). The MRP-to-ERP transfer function measurement is described in

    section 2.6.2.3. Various delay times are evaluated in section 2.7.1

    2.3.2.3 Determining the 02R parametric equalizer setting

    The headphones used in auralization introduced an insertion loss in the Direct AC path.

    As a result, Path 1 essentially functions as Direct AC modeling and compensation of in-

    sertion loss (CIL) induced by the headphones. The CIL filter was implemented using the

    digital parametric equalizer on Channel 1 in the 02R mixer.

    Two microphones, M30 and B3, were first calibrated for identical gain and then used to

    measure the insertion loss of the headphone as shown Figure 8.

    Figure 8 Setup for measuring the headphone's insertion loss using an isolation tube.

  • 8/3/2019 Wa Ms Thesis

    33/80

    22

    Figure 9 Isolation tube used in insertion loss measurement

    An isolation tube (see Figure 9) was built to measure the insertion loss of the headphone.

    The tube was 300mm in length and 250mm in diameter. It had a 50mm thick soft fiber-

    glass outer shell with a thin layer of cotton lined in the inner wall. The headphone was

    carefully mounted to the tube opening and is sealed with rubber for any air-gaps. A Ya-

    maha MSP5 2-way loudspeaker was used to generate the measurement signal while an

    M30 microphone was positioned close to the headphone enclosure outside the tube and a

    B3 microphone was mounted 10mm away from the headphone transducer inside the tube.

    The transfer function was recorded using EASERA. The inverted magnitude-spectrum

    plot represents the compensation filter.

    The internal digital parametric equalizer in the 02R was used to approximate the com-

    pensation filter response using a similar adjustment method described above for PE-15

    (section 2.3.2.1). To assure unity gain through the 02R during filter adjustment, a sine

    tone was fed from EASERA and split to two. One was routed back to EASERA channel

    1 and the other was connected to the 02R and returned to EASERA Channel 2. The CIL

    filter implemented in the 02R is shown in Table 2.

  • 8/3/2019 Wa Ms Thesis

    34/80

    23

    Table 2. CIL filter settings (digital parametric equalizer on 02R)

    CIL filter Band 1 Band 2 Band 3 Band 4

    Gain (dB) +2.0 +3.0 - -

    Frequency (Hz) 4k 10k - -

    Q 0.2 0.1 - -

    2.3.3 Implementation of Indirect Air Conduction Modeling

    In Path 2, which modeled Indirect AC, a real-time binaural convolution was applied

    using the Waves IR1 Convolution Reverb plug-in (see Figure 10). In order to time-align

    correctly, the room impulse response to be convolved was first trimmed to eliminate the

    direct sound. The length of the trim was determined by the propagation delay in Path 2

    which measured 11.74ms (see Figure 11). A Hann window was applied to the trimmed

    impulse response before importing to IR-1. A shortcoming resulted from the latency is

    the incapability of reproducing the room response between Direct AC and the early

    sound field up until 11.74ms (approximately 12 feet of traveling distance which is in the

    order of a 6-foot tall person) which may include diffractions from the subjects own

    body and the first back scattered sound from the floor or other possible nearby bounda-

    ries. Nevertheless, the focus of the current research is stage acoustics which seldom in-

    volves boundaries in close proximity (at least not in the case of this thesis). Also, there is

    no direct specular reflection path between the mouth and the floor. The back-scattered

    rays from the floor were assumed to have minimal influence on the perception of self-

    voice.

  • 8/3/2019 Wa Ms Thesis

    35/80

    24

    Figure 10 Waves IR1 Convolution Reverb, loaded with a 3-second unit sample sequence

    Figure 11 Impulse response trimming before importing to IR-1

  • 8/3/2019 Wa Ms Thesis

    36/80

    25

    2.3.4 Implementation of Headphone Equalization

    The frequency response of the headphones was measured in a hemi-anechoic chamber.

    An M30 microphone was positioned at 10mm in front of the headphones transducer.

    The impulse response was recorded using EASERA. The internal digital parametric

    equalizer on the 02R was used to approximate the headphone compensation filter using

    the method described in section 2.3.2.1. The headphones response was plotted against

    the inverted compensation filter in Figure 12 and the filter settings is shown in Table 3.

    Figure 12 Headphone response and compensation filter (02R master output)

  • 8/3/2019 Wa Ms Thesis

    37/80

    26

    Table 3. Headphone compensation filter settings (02R master output)

    Headphone

    compensation

    Band 1 Band 2 Band 3 Band 4

    Gain (dB) +8.0 +2.5 -4.0 -2.5

    Frequency (Hz) 40 185 1.7k 9.1k

    Q 0.1 0.3 1.0 1.2

    2.4 BRIR Acquisition System

    In order to achieve repeatability in binaural acquisition, a head simulator (or some-

    times called dummy head) was used. In the particular interest of this thesis, the sound

    source and receivers corresponds to human mouth and ears, thus microphones and loud-

    speaker were installed inside the dummy head. The heart of the design was a Brel &

    Kjr Type 5930 head and torso simulator (HATS). The head geometry theoretically rep-

    resented an average of human head physical features according to the compliance of

    ITU-T Rec. P.58, IEC60959 and ANSIS3, 36-1985. It was retrofitted with a loudspeaker

    unit of diameter 50mm, inside the mouth cavity as an artificial mouth. The microphones

    mounted inside the HATS were Brel & Kjr Type 4010 omni-directional transducers.

    The grille of the capsules aligned with the opening of the ear canals as binaural receivers.

    (see Appendix for the microphone specifications in free-field). The structure of the

    HATS is shown in Figure 13 and the position of the artificial mouth is illustrated in

    Figure 14. Detailed dimensional information of the HATS can be obtained from the

    Brel & Kjr website (www.bksv.com)

  • 8/3/2019 Wa Ms Thesis

    38/80

    27

    Figure 13 Frontal plane section of HATS, showing binaural microphones and the related fittings.

    (Adapted from manufacturers manual)

  • 8/3/2019 Wa Ms Thesis

    39/80

    28

    Figure 14 Median plane section of HATS, showing the artificial mouth

    To validate the representativeness of the HATS, a series of verification tests were

    conducted to examine the binaural microphone characteristics, artificial mouth fre-

    quency responses and MRP-to-ERP transfer function. (See section 2.6)

    In BRIR acquisition, the HATS was supported by a microphone stand so that itsheight measured 5 ft 7 in or 67 inches (approximately 1.7-meter), which is about the

    mean height between age 20 to 74 of both men and women reported by U.S. Department

    Of Health And Human Services in 2004 [26]. The binaural microphones inside the

    HATS were connected to the EASERA measurement system during acquisition. Before

    data acquisition, their gain settings were optimized to -10dBFS using a microphone cali-

  • 8/3/2019 Wa Ms Thesis

    40/80

    29

    brator of 1kHz sine tone (105dBA). Since the binaural microphones cannot be easily re-

    moved from the HATS for calibration, and any such repetitive preparation may imply

    microphone position misalignment, a compromised approach was proposed. The calibra-

    tor was positioned as close to the binaural microphone as possible while it was on axis.

    The artificial mouth was driven by a Samson Servo 170 power amplifier which has

    a published linear frequency response between 20Hz and 20kHz. Figure 15 shows the

    BRIR acquisition system block diagram.

    Figure 15 Binaural room impulse response acquisition system block diagram, showing how the bin-

    aural ears and artificial mouth of the HATS are connected

    As described in section 2.3.3, the binaural room impulse response was trimmed such

    that the direct sound (including any contribution from internal (bone) conduction and

    direct air conduction) in the BRIR was not included in the convolution.

  • 8/3/2019 Wa Ms Thesis

    41/80

    30

    2.5 Experimental Procedures in Subjective Test

    2.5.1 Test Subject ConditioningIn the current research, it was inevitable to use test subjects voice in real-time

    auralization, the sound stimuli thus becomes highly unpredictable and may lead to

    erroneous results due to variance in the subjects conditions, both physiological (i.e.

    vocal fatigue) and psychological (i.e. personal emotion). In order to minimize the

    experimental errors, a set of experimental procedures for human subjects was adapted

    and modified from the method used by Jnsdottir, et al [27]. In each measurement,

    subjects are asked to read a piece of text at least twice before subjective scoring. Foreach subject, the same set of test was repeated 6 times within a period of 21 days; on the

    test day, three trials were in the morning/midday, and three other trials were in late

    afternoon/ early evening. Before each experimental session, subjects were asked to warm

    up their voice to performing condition (which takes about 10-20 minutes). Subjects

    were also assured of under no influences of drugs and alcohol. The above procedures

    were expected to lessen the impact of subjects individual conditions over a period of

    time.

    2.5.2 Use of Dramatic Text in Study of ActorsThe objective of the thesis is to investigate voice support in theaters, and so subjects

    were actors who had been trained professionally. Instead of sentences from the Harvard

    Psychoacoustics Sentence Lists (often recommended for psychoacoustics research of

    speech), a short edited excerpt from Shakespeares playHamletwas chosen for its well-

    known dramatic expression and inclusion of most vowels and consonances in English.

  • 8/3/2019 Wa Ms Thesis

    42/80

    31

    To be, or not to be; that is the question;

    To die, to sleep, nor more;

    and by a sleep to say we end the heart-ache

    and the thousand natural shocks.

    The reason of using a dramatic text was because actors seldom read sentences that

    do not have literal meanings. The Harvard sentences are far from reality and are consid-

    ered to have no representation of acting in theater. The second argument in this particu-

    lar thesis was that actors always know what they are going to say (as they have rehearsed

    before performance), and because the current research is about self-perception, so there

    is no issue of intelligibility of unexpected words/ vowel sounds from an unknown sound

    source.

    Subjects were given a sample recording of the text, prior to each test sessions, to get

    accustomed to the rhythm and speed of the speech. And the entire test was recorded and

    analyzed for the pace afterwards.

    2.5.3 Verifying the Consistency of Self-voice Stimuli by Monitoring the Pace ofSpeech

    A method of pace analysis was developed with reference to Andos work on the re-

    lationship between subjective preference and objective parameters. Ando found that the

    minimum effective duration (e min) of running auto-correlation function (ACF) of the

    sound source is proportional to the most preferred delay of a single reflection [28].

    Andos results suggested that the faster the tempo of the stimuli, the lower the resultant

    e min and thus the shorter preferred delay of first reflection. In the current thesis, it was

    attempted to use this parameter as a reference to the pace of speech.

  • 8/3/2019 Wa Ms Thesis

    43/80

    32

    ACF is defined by:

    !p (") = lim

    T#$

    1

    2Tp '(t)p '(t+ ")dt

    %T

    +T

    &

    where p(t) = p(t) * s (t), s(t) corresponds to ear sensitivity and is chosen as the impulse

    response of an A-weighting filter, as suggested by Ando.

    The normalized ACF is expressed as:

    !p(") =

    #p(")

    #p(0)

    The effective duration of the envelope of normalized ACF is represented by !e, which is

    the initially 10dB of the ACF decay (or called the 10 percentile decay), obtained by line

    regression.

    !eis obtained at 100ms-intervals of the running source signal and the minimum

    value is expressed as e min. Figure 16 shows the values ofe against the elapsed time of

    two speech recordings. The e min for sample 1 and 2 are 0.39s and 0.37s respectively,

    indicating that a slight variation of the pace of speech would not change the overall tem-

    poral characteristics.

  • 8/3/2019 Wa Ms Thesis

    44/80

    33

    Figure 16 Example plot of effective duration of running autocorrelation function of two recordings

    of the same text at different pace, showing the use of e min as a temporal reference for monitoring

    subjective testing.

    From Figure 16, it is observed that there is a slight shift of the peaks which indicates

    the different paces of speech in the two samples (red is faster and blue is slower).

    Through a number of trials, it was proven that e min can be used as a robust quantifier for

    verifying the speech pace in subjective testing.

    By calculating e min on each trial, the pace was ensured by maintaining a deviation

    of less than 5% of the e min value of the sample recording. This is a procedure to secure

    all subjects are giving their preference score under the same condition within a fixed tol-

    erance.

  • 8/3/2019 Wa Ms Thesis

    45/80

    34

    A total of 15 subjects participated in this project, but 2 of them were dropped out

    because they did not manage to attend all required tests. The final 13 subjects included 6

    Caucasians, 2 Asians, 3 Black Americans and 2 Latin Americans.

    All subjective tests in this thesis followed the above scheme unless otherwise stated.

    2.6 HATS Verification TestsAll evaluation measurements of the HATS were conducted in an anechoic chamber

    at General Electric Laboratory, Niskayuna, NY (see Figure 17). The volume of the

    chamber was 5100 cu. ft. The background noise was rated at under 20dbA. Room tem-

    perature and humidity were measured before and after each experimental session and

    recorded negligible variations.

    Figure 17 Verification Test of HATS in anechoic chamber at General Electric Laboratory (NY).

  • 8/3/2019 Wa Ms Thesis

    46/80

    35

    2.6.1 Binaural Microphones

    The microphones at the two ears of the HATS were first calibrated using a sine tone

    calibrator (1kHz at 105dBA) to achieve identical gain. A dodecahedron speaker was po-

    sitioned at 1 meter away from the HATS, directly in front of the HATS mouth opening.

    The frequency responses were averaged over 3 measurements taken with different

    orientations of the dodecahedron speakers to minimize any potential error caused by di-

    rectional characteristics of the dodecahedron loudspeaker in near field.

    Figure 18 shows a frequency response overlay of the two binaural microphones.

    Figure 18 Frequency response comparison between HATS binaural microphones

    The result showed acceptable differences between the two binaural microphones.

    The peaks and notches are due to head-related transfer function and pinna of the HATS.

  • 8/3/2019 Wa Ms Thesis

    47/80

    36

    2.6.2 Artificial Mouth

    In BRIR acquisition, the loudspeaker unit inside the HATS was used to generate ex-

    citation signals in the HATS mouth cavity; it is thus called the Artificial Mouth.

    There are no readily available specifications of the artificial mouth from the HATS

    owner. Its directivity, on-axis response, and ERP-to-MRP transfer function were meas-

    ured.

    2.6.2.1 DirectivityThe directivity measurements were conducted in 15-degree resolution on the horizontal

    plane (full 360 degrees) and the vertical plane (range from -45 to 135 degrees). The

    HATS was left stationary throughout the measurement session while the microphone

    position was manually adjusted for each measurement.

    Figure 19 Voice directivity of HATS (a) horizontal plane (b) vertical plane, in 4 octave bands

    (250Hz, 500Hz, 1khz & 2kHz)

    The directivity was plotted using Matlab and was compared to data from other pre-

    vious artificial heads and other mean values of human voices, as shown in Figure 20.

  • 8/3/2019 Wa Ms Thesis

    48/80

    37

    Figure 20 Comparison of voice directivity, in 3 ocatve bands (500Hz, 1kHz & 2kHz)

    The current HATS was found to be reminiscent to the human mean values reported

    in literature. Although it did not directly reflect a higher reliability of experimental re-

    sults, it did suggest the HATS was a good representation of human voice source.

    2.6.2.2 On-axis Frequency Response

    The on-axis frequency response of the HATS was measured at 1m directly in front

    of the artificial mouth. Figure 21 shows the frequency response. It was rather erratic

    which may have resulted from the construction of the HATS. The HATS itself had in-

    herent resonance characteristics and the head cavity was not dampened by any materials.

  • 8/3/2019 Wa Ms Thesis

    49/80

    38

    The prominent peaks in the high-mid frequency range were believed to have relation-

    ships with the resonant frequencies of the HATS.

    Figure 21 On-axis frequency response of HATS artificial mouth. Overlays of no-smoothing and

    1/6-octave smoothing.

    Two resonant peaks were observed 6.6k and 7.6k Hz which roughly correspond to dis-

    tances of 0.05m and 0.045m. It was believed they are related to the position of the loud-

    speaker unit in respect to the HATS internal cavity.

    The current artificial mouth response was compared to another commercially available

    model, B&K Type 4128C artificial mouth (see Figure 22).

  • 8/3/2019 Wa Ms Thesis

    50/80

    39

    Figure 22 Frequency Response of B&K Artificial Mouth Type 4128C [adapted from manufacturers

    datasheet]

    The frequency response of a loudspeaker unit coupled to the HATS is expected to dis-

    play anomalies due to the interaction with the physical geometry. It is hardly possible to

    achieve a flat frequency response in such deviced. The assumption here is the given

    similar voice directivity analyzed in section 2.6.2.1 is acceptable in the current study.

    2.6.2.3 MRP-to-ERP Transfer FunctionTwo microphones were used in this measurement. They are Earthwork M30 omni-

    directional microphone and a Countryman B3 miniature omni-directional microphone.

    The two microphones were first calibrated by measuring their transfer function (see sec-tion 2.2) which was used as an equalization function in the measurement.

    The M30 and B3 were placed at MRP and ERP respectively. The B3 capsule was

    suspended such that there was no physical contact with the HATS. It was to eliminate

    any internal vibration transmission from the artificial mouth to the microphone via the

  • 8/3/2019 Wa Ms Thesis

    51/80

    40

    HATS surface. The transfer function was measured in EASERA and compensated with

    the equalization described above. Result is shown in

    Figure 23 MRP-to-ERP transfer function of HATS (1/3-octave smoothing)

    The above plot was smoothed and zoomed in order to compare to the averaged results of

    human subjects in Prschmanns study, as shown in Figure 24.

  • 8/3/2019 Wa Ms Thesis

    52/80

    41

    Figure 24 Averaged frequency response of MRP-to-ERP (Direct AC) of 18 human subjects. The

    grey area marks the satndard deviation. (Adapted from Porschmann [2000])

    The HATS frequency response was considered to be roughly falling into the human

    averaged values, except it was slightly lower in the frequency range between 300Hz to

    2kHz.

    2.7 Subjective Test on Naturalness of the Auralization SystemConsidering the deviation of every human heads, the validity of the auralization

    setup can be proven by subjective tests. The aim of this evaluation test is to find the op-

    timal delay time, and filter implementation for Direct AC, as explained in section 2.3.2.1

    and 2.3.2.2. In the evaluation of the systems naturalness, Indirect AC (as stated in sec-

    tion 1.2.3) was ignored.

  • 8/3/2019 Wa Ms Thesis

    53/80

    42

    2.7.1 Evaluation on Naturalness of CIL Filter Delay Time

    The delay time implementation for Direct AC represents the propagation delay from the

    mouth opening to the ear canal entrance. It is critical in the auralization process because

    the delayed voice reproduced by the headphone was acoustically combined with the di-

    rect sound traveling from the mouth through the open-back headphones, before entering

    to the subjects ear canal. Improper delay time would induce perceivable echoes or

    comb-filtering effect. Since everyones head geometry and facial features are different

    from each other, the delay time may vary. It was assumed that the comb-filters were in-

    audible if the separation of arrival time is short enough so that the lowest null frequency

    of the comb filter is beyond the audible spectrum. For instance, a time delay of 20-

    microsecond would produce comb filtering starting at a frequency of 25kHz.

    This test was to find out the most natural and representational delay time to be used in

    auralization without compromising the perceptual naturalness of self-voice. The tests

    were conducted with 13 subjects following the procedures stated in section 2.5, using the

    auralization setup in section 2.1.

    During the test, subjects were exposed to a random sequence of delay times which in-

    cluded 3 repetitions of the 8 delay settings. The random sequence was generated by Mat-

    lab individually for each test set. For each setting, subjects were asked to read the given

    text in full twice, one with headphones and one without. Subjects were allowed to repeat

    reading and listening until they were comfortable to compare the sound of their own

    voice with headphones to that without headphones (the reference). Then, the subjects

    would rate the degree of similarity on a 7-point category scale in the range of 1 to 7 (1

    being very dissimilar and 7 being very similar) for a given setting and notify the experi-

    menter via the microphone before changing to the next delay setting.

    The aims of this evaluation test are to find out the most optimal delay time for

    the CIL filter for auralization and validate the effectiveness of the current system by

    comparing the results in previous experiments. Thus, the delay times evaluated were

  • 8/3/2019 Wa Ms Thesis

    54/80

    43

    specifically chosen to match with those found in literature in order to make a direct com-

    parison (see Table 4).

    Table 4. Delay time in evaluation test of naturalness of Direct AC insertion loss compensation filter

    (CIL filter). *The current system has a processing delay of 0.14ms with a setting of 0.01ms on

    DN716. ** Prschmann's tested delay times were based on taps of a 48kHz Tucker Davis DSP sys-

    tem, and they are represented here in millisecond for convenience of comparison.

    DirectAC

    Delay Time0.14ms 0.3ms 0.47ms 0.63ms 0.97ms 1.63ms 2.30ms 2.97ms

    DN716*

    Delay Setting0.01ms 0.17ms 0.34ms 0.50ms 0.84ms 1.50ms 2.17ms 2.84ms

    Prschmanns

    Test [2001]**-

    0.3ms

    (0taps)

    0.47ms

    (8taps)

    0.63ms

    (16taps)

    0.97ms

    (32taps)

    1.63ms

    (64taps)

    2.30ms

    (96taps)

    2.97ms

    (128taps)

    The same test was repeated 6 times for each subject totaling 144 data. All results were

    combined and compared with Prschmanns published results as shown in Figure 25.

  • 8/3/2019 Wa Ms Thesis

    55/80

    44

    Figure 25 Subjective evaluation of naturalness of delay time in Direct AC auralization. Mean score

    and error rate of all subjects (95% confidence)

    The above comparison showed that the current evaluation exhibit a similar preference

    trend as observed in Prschmanns results. Subjects tended to prefer lower delay times.

    The current experiment included 1 more delay setting (0.14ms), which is less than the

    nominal MRP-to-ERP propagation delay of 0.3ms. Interestingly, the most preferred de-

    lay in the current setup is 0.14ms. This might be resulted from the difference in the defi-

    nition of MRP (MRP80 80mm from the lips) in the current research as compared to

    Prschmanns (MRP40 40mm from the lips).

  • 8/3/2019 Wa Ms Thesis

    56/80

    45

    2.7.2 Evaluation on Naturalness of CIL Filter Level

    This test was to find out the sound level of the CIL-filtered signal at the headphone to

    achieve the best realism. The tests were conducted with 13 subjects following the proce-

    dures stated in section 2.5, using the auralization setup in section 2.1.

    There are six levels under test. Inf, -6dB, -3dB, 0dB, +3dB and +6dB. 0 dB indicates

    unity gain of the CIL filter and Inf refers to the absence of the filter compensation path

    in auralization. Positive dB values represent a gain in the filtered compensation whereas

    negative values represent an attenuation.

    During the test, subjects were exposed to a random sequence of sound levels which in-

    cluded 3 repetitions of 6 different settings. The random sequence was generated by Mat-

    lab individually for each test set. For each setting, subjects were asked to read the given

    text in full twice, one with headphones and one without. Subjects were allowed to repeat

    reading and listening until they were comfortable to compare the sound of their own

    voice with headphones to that without headphones (the reference). Then, the subjects

    would rate the degree of similarity on a 7-point category scale in the range of 1 to 7 (1

    being very dissimilar and 7 being very similar) for a given setting and notify the experi-

    menter via the microphone before changing to the next CIL-filter level setting.

    The same test was repeated 6 times for each subject totaling 108 data. All results were

    combined and compared with Prschmanns published results as shown in Figure 26.

    The results showed that the current evaluation has an overall lower score in all settings

    but displays a similar trend as observed in Prschmanns setup. Subjects agreed the

    nominal level setting for the CIL filter is most natural, proving the successful implemen-

    tation.

  • 8/3/2019 Wa Ms Thesis

    57/80

    46

    Figure 26 Subjective evaluation of naturalness of CIL filter level in Direct AC auralization. Mean

    score and error rate of all subjects (95% confidence)

    2.8 Discussion

    Overall, the current auralization setup gave satisfactory results in evaluation of

    naturalness in Direct AC. The CIL filter level was determined to be at unity gain. The

    delay setting was chosen to be 0.01ms on DN-716 (resulting in 0.14ms of total delay).

  • 8/3/2019 Wa Ms Thesis

    58/80

    47

    3. Subjective Preference Tests on Stage Acoustic Conditions for Actors

    3.1 Introduction

    The focus of the current study is voice stage support in proscenium theaters. This

    particular type of theater architectural design creates a special acoustical phenomenon. A

    proscenium theater is characterized by the separation of a stage house from the main

    auditorium by a large opening called proscenium. The two acoustic volumes (stage &

    auditorium) can have very different acoustical properties depending on the design. In

    general, the stage house usually has a high ceiling for counter-weight flying systems and

    thick curtains were hung above and beside the stage area in order to mask off-stage ar-

    eas, lighting instruments and unused scenery pieces. The stage is often not specifically

    designed for acoustical purposes, whereas, in the auditorium, it is usually designed to

    optimize the audiences aural experience.

    The interaction of these two volumes is known as coupling space phenomenon in

    architectural acoustics. Considering the actors location and movement on stage, they are

    mostly in the stage volume. As they move closer to the audience and reach the prosce-

    nium or the forestage, they are moving into the aperture of the coupled spaces where the

    stage volume and the auditorium meet. Unlike an audience, actors are constantly moving

    and turning on stage while acting. Their experience of the sound field changes drasti-

    cally every second. As a beginning of research work in this field, the current study aims

    at investigating acoustic conditions on stage presuming the actor is not moving. Actors

    sometimes speak in a fairly fixed location for a sustained period when they are having

    monologues (or prose). The most common positions are at center of the forestage or the

    central area of the main stage. These two positions were studied in this research and two

    more positions off to the side were chosen as well. Ideally, more positions should be

    studied, but the limiting factor was the time required for each subject to go through all

    acoustic conditions without causing their vocal fatigue.

  • 8/3/2019 Wa Ms Thesis

    59/80

    48

    Figure 27 Architectural plan of the main space at the RPI Playhouse. Dimension unit in inches. Blue

    lines are dimensional guides. (CAD drawing courtesy of RPI Building Management)

    3.2 Impulse Response AcquisitionBinaural room impulse responses were collected when the playhouse was unoccu-

    pied. The auditorium of the 200-seat playhouse has an area of approximately 2650 sq. ft.

    (246.1 sq. m.) and a volume of around 39750 cu. ft. (1125.5 cu. m.). The stage is located

    opposite to the playhouse entrance separated by a proscenium with dimensions of 31.6 ft

    by 14 ft. (9.6m by 4.2m). The stage level is 5 feet higher than the auditorium; and it has

    an area of roughly 1260 sq. ft. (117 sq. m.) and a volume of 25200 cu. ft. (713.6 cu. m.).

    Figure 27 shows the detailed dimensions. All seats were removed and the stage was

    cleared during measurement sessions. The stage was set to a standard configuration of

    masking flats and borders hung above stage.

  • 8/3/2019 Wa Ms Thesis

    60/80

    49

    Measurement was taken at four stage locations using the setup described in sec-

    tion 2.4. The relative positions of the stage locations are shown on Figure 23. The acro-

    nyms used stands for down stage center (DSC), down stage right (DSR), center

    stage center (CSC) and center stage right (CSR). It should be noted that stage right

    is defined as the right-hand-side from the actors perpective when he or she is facing the

    audience. The four stage locations were 3 meters apart from each other.

    Figure 28 Stage locations where BRIR was measured. Dashed line labeled "CL" is the center line of

    the stage across the proscenium.

    At each stage positions, the HATS were adjusted to three different head orienta-

    tions manually with the aid of rotation angle guide marked on the top of the HATS. The

    relative angles of the head rotation are shown in Figure 29. The height of the HATS was

  • 8/3/2019 Wa Ms Thesis

    61/80

    50

    maintained at 5 feet and 7 inches (about 1.7 meter) above the stage floor as described in

    section 2.4.

    Figure 29 Top view of the HATS showing 3 different head orientations in binaural room impulse

    response acquisition.

    Due to the background noise level at the playhouse, an excitation signal of pink

    sweep sine of length 21.8s was used to achieve better signal-to-noise ratio (SNR). The

    reported average SNR in all measurements was 52dB. Three averages were taken in each

    measurement.

    3.3 Subjective Test Design3.3.1 Preference ratings from paired comparison

    In section 2.7, a 7-category rating method was chosen for a reason - to generate

    compatible results to compare with previously published data. Nevertheless, it has uncer-

    tainty in absolute judgment between subjects. Paired comparison can help reduce the ab-

    solute judgment errors and also the relative judgment errors. In this chapter, different

    combinations of acoustic conditions were given in pairs to the subjects in a randomized

    order. In each of the paired comparison tests in this chapter, there were four different

    acoustic conditions. The six pair-combinations were A-B, A-C, A-D, B-C, B-D and C-D.

    Each pair was repeated twice, one in forward order while the other in reversed order.

    This would give a total of 12 randomized pair-conditions for each subject in each test.

  • 8/3/2019 Wa Ms Thesis

    62/80

    51

    (The author first proposed 4 repetitions of each pair, resulting in 24 test conditions, but

    test subjects reported that they had vocal fatigue and lost concentration after a certain

    period of time (usually 30 minutes). Thus, in order to balance reliability of test results

    and test subjects personal comfort, it was reduced to 12 pair-conditions in the stage

    preference testing.)

    3.3.2 Test procedures

    In each comparison, the pair-conditions of (Indirect AC) were pre-loaded to preset

    A and preset B in IR-1 (see Figure 10 and note the long A: " button below the

    red circle named RTAS). The subjects were able to compare the pair-conditions by

    asking the experimenter to A/B swap the presets and they allowed to spend as much time

    as they wanted in each preset (condition). After auditioning both, the subjects were

    asked, Which did you prefer? The preferred condition was scored as +1 and otherwise

    as -1. Preference scores were added and normalized. A score of 1.0 indicates complete

    unanimity of preference judgment of the acoustic condition, 0.0 means an equal number

    of yes and no scores, and -1.0 means complete agreement on a negative preference.

    After each complete set of paired comparison, the subjects were allowed to rest for

    5 minutes before the next set of tests began. All other procedures follows section 2.5.

    3.4 Paired Comparison Test on Stage Locations

    The paired comparison tests were in three groups. Each group represents one head orien-

    tation they were look center, look left and look right. Four stage locations were

    studied in each group. The preference scores were obtained and analyzed.

  • 8/3/2019 Wa Ms Thesis

    63/80

    52

    3.4.1 Preference study of stage locations when head orientation is look center3.4.1.1 ResultThe preference scores for each subject were averaged over the 6 test sessions. Scores

    were obtained for 4 Conditions (A,B,C and D) corresponding to the 4 stage locations

    (DSC, DSR, CSC and CSR) respectively. The results of all 13 subjects are plotted indi-

    vidually in Figure 30. Then, scores of all subjects in each condition were averaged and

    plotted as well.

    The result shows an agreement of preference on Condition A (DSC-down stage cen-

    ter) and Condition C (CSC-center stage center) in all subjects while Condition A is

    slightly more preferred than Condition C in the overall average. Another agreement

    across all subjects is the negative preference on Condition D (CSR).

  • 8/3/2019 Wa Ms Thesis

    64/80

    53

  • 8/3/2019 Wa Ms Thesis

    65/80

    54

  • 8/3/2019 Wa Ms Thesis

    66/80

    55

    Figure 30 Preference score of different stage locations when head orientation is "look center"

    (Conditions - A: DSC, B: DSR, C:CSC, D: CSR). Normalized scores of 13 individual subjects A-M

    (blue bar graph) and overall average score of all subjects (red bar graph)

    3.4.1.2 AnalysisUpon interviewing with the test subjects after the experiment, their common feedback

    was the lateralization of the sound decay and the change in the sense of envelopment.

    Some of the subjects had questions, during test sessions, of whether the volume of the

    headphones was not balanced on the two channels. Subjects expressed an inclination to

    spatially balanced and encapsulating sound fields. Interaural cross-correlation function

    (IACF) is commonly used in analyzing the impact side reflections and subjective prefer-

    ence of room width. IACF can also be used to visualize the lateralization of a running

    sound source.

    IACF is defined as:

    IACFt(!) =

    PL(t)P

    R(t+ !)dt

    t1

    t2

    "

    PL

    2(t)

    t1

    t2

    " PR2(t)dt

    t1

    t2

    "#$%%

    &

    '((

    1/ 2

    where L and R refer to the entrances to the left and right ear canals. The maximum pos-

    sible value of IACF is one when both signals are the same. The variable accounts for

  • 8/3/2019 Wa Ms Thesis

    67/80

    56

    the time difference between the two ears and is varied over a range from -1 to +1 ms

    from the first arrival [29].

    The IACF for each condition was calculated in 100-ms intervals and plotted in Figure

    31.

    Figure 31 Interaural cross-correlation functions in 100ms-intervals for conditions A-D when head

    orientation is "look center".

    From the above IACF plots, the spike at time = 0ms indicates the highest correla-

    tion between the two ears at the onset of the impulse response. As the sound decays, the

    correlation rapidly reduces to around zero. This suggested the early reverberation field (0

    to 400ms) was fairly diffused in all four conditions. The correlation rises towards the

    late reverberation field (after 400ms). It should be noted that the rise in IACF does not

    result in a prominent peak across the binaural sound field. This rise is apparent in both

  • 8/3/2019 Wa Ms Thesis

    68/80

    57

    Condition A & C, which suggests a correlation between the subjective preference score

    and the behavior of the late reverberation field. The late reverberation in the playhouse

    was believed to have contributed by the reflections from the back wall in the auditorium.

    The absence of peak in late IACF implies the diffusiveness of the late reflections and a

    sense of envelopment.

    Furthermore, the energy decay was examined. Due to the fact that, in acquiring

    the impulse response, the sound source (artificial mouth) and the receivers (binaural

    ears) are in extreme close proximity, conventional reverberation time calculation cannot

    be applied directly to inform the subjective sensation of the decay. It is also unknown

    that how human perceives reverberation time of the same acoustic space differently

    when listening to his or her own voice versus other sound sources. As a result, an alter-

    native parameter, Voice Stage Support (VSS), was proposed to analyze the energy de-

    cay:

    VSSti = 10log10

    p2 (ti )dtti !90

    ti +10

    "p2 (t)dt

    0

    10

    "

    where ti is the time interval every 100ms.

    Since the energy of the direct sound from the artificial mouth to the ears is constant, it is

    taken as a reference of initial energy. The energy ratio was calculated on every 100ms

    interval after the initial 10ms. The results of VSS of


Recommended