Cosc 6326/Psych6750X
Audition and Auditory Displays
Use of auditory displays
Sound in information display
• speech provides a high bandwidth communication channel
• audition is a long distance sense without field of view restrictions
• Sound is useful for information display (Cohen & Wenzel 1995) – when origin of message is a sound (voice, music)
– when message is simple and short (e.g. event markers)
– when message will not be referred to later (e.g. time)– when message deals with events in time– warnings or prompts (hearing is always on, no field
of view issues)– continuously changing information (e.g. countdown)– when other systems (e.g. vision) are overloaded
– when verbal response is required (compatibility)
– when illumination or disability prevents vision (e.g. alarm clock, limited field of view, blindness)
– when the user moves from place to place (sound as an ubiquitous I/O channel)
Sonification
• In ‘visualization’ situations, ‘sonification’ of data can assist in the exploration of complex datasets
• In these applications ‘realism’ is typically not a major issue
• Sound can help interpret complex or multidimensional data; can provide an independent display dimension
• In addition to information display, in immersive displays sound contributes to: – realism, situational awareness and presence– ambience and emotive context– cueing visual attention– natural communication– space perception
Realism and ambience
• High quality sound improves perceived ‘quality’ of visual displays
• Sounds in the environment provides vital information that contributes to situational awareness
• Persistence of sounds of objects out of field of view may help maintain object permanence
• Sound is believed to be vital for conveying emotion and ambience in movies
• Ambient sounds can be realistic or abstract (e.g. music to set mood)
• Absence of appropriate sound degrades realism
• If background sounds are not well matched to visuals participant may feel detached –‘presence’ may be degraded
• Relation between presence and realism is not straightforward (later lecture)
• Sound is an omni-directional sense and may help user feel immersed in the VE
• Auditory collision cues may help navigating a VE (especially with HMDs)
Audition
Sound
• Sound is “mechanical vibrations and waves of an elastic medium, particularly in the frequency range of human hearing (16 Hz to 20 kHz)”
• Normally, the medium is air. Sound is an air pressure wave.
• Sound is usually used to describe the physical stimulus.
• Audition refers to perception. • An auditory event is usually elicited by a
sound event.• A sinusoidal pressure wave is known as a
pure tone.
t
x(t)
T0=1/f0
• Sinusoid– x(t) = A cos(2f0t + )
A is amplitude f0 is frequency
is phase – T0 is period is related to time shift
of peak
fcwavelength ==λ
Dimensions of sound • Harmonic content: pitch, melody, harmony,
waveshape, timbre, vibrato• Timing: duration, tempo, rhythm, • Loudness, envelope• Spatial: azimuth, elevation, distance• Ambience: resonance, reverberation,
spaciousness• Representation: literal, auditory icons, abstract
• Perceptual and physical dimensions are analogous but distinct– pitch and frequency (directly related for pure
tones)– loudness and intensity– timbre and complexity
Matlin and Foley, Sensation and Perception
Kandel et al, Principles of Neural Science
Physiology and psychophysics
• Cochlea performs mechanical spectral analysis of sound signal
• Pure tone induces traveling wave in basilar membrane.– maximum mechanical displacement along
membrane is function of frequency (place coding)• Displacement of basilar membrane changes with
compression and rarefaction (frequency coding)
Matlin and Foley, Sensation and Perception
Kandel et al, Principles of Neural Science
Perception of pitch
• Along the basilar membrane, hair cell response is tuned to frequency– each neuron in the auditory nerve responds to
acoustic energy near its preferred frequency– preferred frequency is place coded along the
cochlea. Frequency coding believed to have a role at lower frequencies
• Higher auditory centers maintain frequency selectivity and are ‘tonotopically mapped’
• Pitch is related to frequency for pure tones. • For periodic or quasi-periodic sounds the pitch
typically corresponds to inverse of period• Some have no perceptible pitch (e.g. clicks,
noise)• Sounds can have same pitch but different
spectral content, temporal envelope … timbre
Perception of loudness
• Intensity is measured on a logarithmic scale in decibels
• Range from threshold to pain is about 120 dB-SPL
• Loudness is related to intensity but also depends on many other factors (attention, frequency, harmonics, …)
Spatial hearing
• Auditory events can be perceived in all directions from observer
• Auditory events can be localized internally or externally at various distances
• Audition also supports motion perception– change in direction– Doppler shift
• Ability to localize depends on sound source and environment– a tone in reverberant room is difficult to locate
in time and space– a click in an anechoic chamber, on the other
hand, is precisely located and time limited
Auditory Scene Analysis
• Process of separating out the different sources present in the environment
• Detection and segregation of distinct sources
• Grouping of sounds in spatial and temporal proximity into single streams
Cocktail party effect
• In environments with many sound sources it is easier to process auditory streams if they are separated spatially
• Spatial sound techniques can help in sound discrimination, detection and speech comprehension in busy immersive environments
Spatial Auditory Cues
• Two basic types of head-centric direction cues– binaural cues– spectral cues
Binaural Directional Cues
• When a source is located eccentrically it is closer to one ear than the other– sound arrives later and weaker at one ear– head ‘shadow’ also weakens sound arrive at
opposite ear• Binaural cues are robust but ambiguous
http://headwize.com/tech/aureal1_tech.htm
• Interaural time differences (ITD)– ITD increase with directional deviation from the
median plane. It is about 600 s for a source located directly to one side.
– Humans are sensitive to as little as 10 s ITD. Sensitivity decreases with ITD.
– For a given ITD, phase difference is linear function of frequency
– For pure tones, phase based ITD is ambiguous
– At low to moderate frequencies phase difference can be detected. At high frequencies can use ITD in signal envelope.
– ITD cues appear to be integrated over a window of 100-200ms (binaural sluggishness, Kollmeier & Gillkey, 1990)
• Interaural intensity differences (IID)– With lateral sources head shadow reduces intensity
at opposite ear– Effect of head shadow most pronounced for high
frequencies. – IID cues are most effective above about 2000 Hz– IID of less than 1dB are detectable. At 4000 Hz a
source located at 90° gives about 30 dB IID (Matlin and Foley, 1993)
Goldstein, Sensation and Perception
Ambiguity and Lateralization
Ambiguity and Lateralization• These binaural cues are ambiguous. The same
ITD/IID can arise from sources anywhere along a ‘cone of confusion’
• Spectral cues and changes in ITD/IID with observer/object motion can help disambiguate
• When directional cues are used in headphone systems, sounds are lateralised left versus right but seem to emanate from inside the head (not localised)
• also for near sources (less than 1 m) there is significant IID due to differences in distance to each ear even at lower frequencies (Shinn-Cunningham et al 2000)
• Intersection of these ‘near field’ IID curves with cones of confusion constrains them to toroids of confusion
Spectral Cues
• Pinnae or outer ears and head shadow each each ear and create frequency dependent attenuation of sounds that depend on direction of source
• Pinnae are relatively small, spectral cues are effective predominately at higher frequencies (i.e. above 6000 Hz)
• Direction estimation requires separation of spectrum of sound source from spectral shaping by the pinnae
• Shape of the pinnae shows large individual differences which is reflected in differences in spectral cues
Distance Cues
• anechoic– intensity decreases
with distance– attenuation is higher at
high frequency– confound with
spectrum and intensity of source
• Near field IIDhttp://headwize.com/tech/aureal1_tech.htm
http://headwize.com/tech/aureal1_tech.htm
• reverberation– ratio of direct to reverberant energy indicates
distance wrt environment– reverberation pattern indicates ‘spaciousness’
of the environment– reverberation is more realistic but can degrade
localisation, speech recognition …
Visual-Auditory Interactions
• Auditory cues associated with visual targets can cue visual attention
• Latency for audition is less than vision• A sound associated with visual target
– can speed visual search– can reduce response times– facilitate saccadic eye movements– can cue attention outside the field of view
• Ventriloquism and visual capture– When a visual and auditory source are grouped,
the sound is usually perceived in the direction of the visual target
Auditory/Aural Displays
• Headphone displays– Precise independent control of inputs to each ear. – Individual display. – Closed ear type can exclude external sounds. Reduces
interference from external sources; simplifies AR systems.– Entail an encumbrance. – Diotic, dichotic (stereo) and spatialised displays– Head fixed frame of reference. Display needs to be head
tracked to register with virtual world.
• Speaker systems– Simpler, less encumbrance, multi-user– Cannot ‘occlude’ real world sounds but can sometimes
mask– Complication with echoes and cross-coupling between
channels– Interference from/with visual displays– World frame of reference. – Subwoofer allows for deep bass. Could augment
headphones
Spatialised audio
• simple ITD, IID cues in a display lateralize a sound. Sound is not ‘externalized’
• spatialised audio: generate most of the spatial cues in real world environment using signal processing
• with appropriate modeling of sound sources and user tracking can provide a compelling illusion of spatial sound in a VE
http://www.engr.sjsu.edu/~knapp/HCIROD3D/3D_sys1/binaural.htm
Binaural recording
• Head related transfer function (HRTF)– describes how sound at a given location is
transformed (by pinnae etc.) as it travels to the ear, as a function of frequency
– function of source direction and distance and frequency (4D)
– equivalent to the Fourier transform of the response to a impulse source at the desired position
– IID and ITD as well as spectral cues are incorporated (interaural differences in HRTF)
Shilling & Shinn-Cunningham 2001
0.15 m
1.0 m
• To simulate a source at a given location– correct HRTF for response of the speaker
system– convolve source with impulse response
corresponding to corrected HRTF. – multiple sources possible by adding up HRTF
transformed signals
• To measure HRTF– place microphones in ear canals– measure microphone response to short clicks at
various locations– correct for response characteristics of
microphones• Lengthy, painstaking process.• Storage requirements for dense sampling
Cohen and Wenzel, 1995
• Limitations in practice:– sampling: often one distance and limited number of
directions– interpolated for other locations– generic versus individualized HRTF’s (front/back
confusion and elevation errors)– HRTF is a characteristic of the user and does not
model effects of environment. – need to track head position. Delay can be problematic.
HRTF measurement using model head (KEMAR)
Room Modeling• Can model the effects of reverberation, echoes etc.
for a room transfer function– Vary with listener and source position– can have very long response – combinatorially impractical
• Has been effort to develop efficient methods for acoustic modeling of rooms
• Improves realism and distance estimation but difficult for real-time immersive VEs
Shilling & Shinn-Cunningham 2001
Speaker Systems
• Spatialised audio complicated by fact that both ears hear each speaker and that reverberation will occur
• Effectiveness is sensitive to speaker placement
• Stereo speakers: sound seems to be localised between the speakers
• increasing number of speakers increases ability to localise sounds (e.g. 5.1 surround sound systems)
• more complex schemes are possible using DSP but very challenging (‘ambisonics’)– cancel interaural cross-talk based on HRTF
corresponding to speaker location– computations are complex, not robust and must be
done in real time if head tracked
Auditory Rendering• Auditory modeling/rendering of VEs
– sampling– synthesis of complex sounds
• spectral• physical models• granular synthesis
– Filtering: HRTFs, reverberation, room modeling– Object occlusion, air absorption, Doppler motion