Acoustic effects of variation in vocal effort by men, women and children
Hartmut Traunmüller and Anders Eriksson
assisted by
Anita Andersson, Ingegerd Eklund and Jessika Rundlöv
with financial support from HSFR and NUTEK for the period 94-07-01 -- 96-06-31 and from SU
Acoustic properties of speech sounds vary because of
linguistic - expressive - organic - perspectival
factors
This investigation is mainly concerned with
expressive variation–vocal effort–mode of phonation (whispering vs. phonating).
and interactions with
organic variation–age–sex
(men, women, children)
Vocal effort: a subjective, physiological quantity
Voice level: an acoustic quantity (SPL of a standard utterance measured at a standard distance)
Alternative ways of controlling voice level:
Trained speaker's/singer's technique–More variation in pulmonic pressure–F0 less affected
Ordinary speaker's technique–More variation in vocal fold tension–F0 more affected
Adopted definition & quantification of "vocal effort"
“Vocal effort” = The quantity that ordinary speakers vary when they adapt their speech to the demands of an increased or decreased communicational distance.
Adjusting "loudness level”(Holmberg, Hillman and Perkell, 1988)
Shouting(Rostolland, 1982)
Speaking in noise(Rastatter and Rivers, 1983; Loren, Colcord, and Rastatter, 1986; Van Summers, Pisoni, Bernacki, Pedlow, and Stokes, 1988; Bond, Moore, and Gable, 1989).
Different effects of white and multitalker noise with same SPL
(Rivers and Rastatter, 1985)
Variation in vocal effort affects the shape of the glottal pulses(vocal fold closing velocity and relative closed interval duration)
(Holmberg et al., 1988; Holmberg, Hillman, Perkell, Guiod, and
Goldman, 1995; Södersten, Hertegård and Hammarberg, 1995).
... reflected in the spectral emphasis of the partials above the first
(Gauffin and Sundberg, 1989; Childers and Lee, 1991; Granström
and Nord, 1992)
Variation in vocal effort affects F1(Frøkjær-Jensen, 1966; Rostolland and Parant, 1974; Schulman, 1985; Bond et al., 1989; Liénard and Di Benedetto, 1999).
F1 difficult to measurebut more open mouth >> higher F1
Variation in vocal effort affects segment durations
(Fónagy and Fónagy, 1966, Rostolland, 1982, and Bonnot and Chevrie-Muller, 1991)
larger effort: longer vowels but somewhat shorter consonants
SPL as a measure of vocal effort?(Liénard and Di Benedetto, 1999)
SPL plays no major part in judgments of vocal effort
(Rundlöf, 1996; Traunmüller, 1997; Eriksson and Traunmüller, 1999)
SPL varies widely as a function of perspectival factors.
Listeners distinguish variations in a speaker’s vocal effort from variations in their own distance from the speaker.
(Wilkens and Bartel, 1977, Eriksson and Traunmüller, 1999)
Our measure of vocal effort:The average rating, by a group of listeners, of the communicational distance for each stimulus.
Our aim:Acquire detailed quantitative knowledge about those acoustic effects of variations in vocal effort that are of perceptual importance.
Relevant to:
Speech synthesis with desired paralinguistic quality
Automatic recognition of linguistic information
Automatic recognition of expressive information
Automatic recognition of organic information
Conversion of paralinguistic quality
Automatic speech-to-speech translation with conserved paralinguistic quality
Theories of speech: The Modulation Theory
Subjects
6 male adults, 20–51 years6 female adults, 20–38 years4 boys, 7 years4 girls, 7 years
all speaking Stockholm Swedish
Speech material
Anita: “Hur många kort tog du av varje färg?”
Jag tog ett violett, åtta svarta och sex vita [ ]
5 phonated and 2 whispered versions
Recording Place: Långängen, Lidingö
DAT-recorder
High quality microphone, wind protected, 50 mm from speaker's lips
Stepwise attenuator 0, 8, 16, 24, and 32 dB
Sampling at 16 kHz, 16 bits per sample
HP-filtering at 70 Hz, 48 dB/octave
ESPS/Waves
For formant frequency measurements resampled at 6.4 kHz for men, 8 kHz for women and 10.667 kHz for children.
Table I. Distances between speaker and addressee. The full range was used for phonated speech. Whispered speech was only used at the two shortest distances.
Version 1 2 3 4 5
Distance (m) 0.3 1.5 7.5 37.5 187.5
Acoustic measurementsSound pressure levels
SPLV (voiced segments & potentially voiced)
SPLS (three [s]-es)
SPL0 (voiced segments LP filtered at 1.5 F0mean, 18 dB/oct.)
Spectral emphasis SPLV - SPL0
Fundamental frequency F0 (mean and SD, excl. creaky voiced sections)
Formant frequencies F1a (average of four [a]-s)
F3 (average of voiced segments & potentially voiced)
Segment durations durV (average of 14 vowels, 3 [v] and 1 [j])
durC (average of 8 stops, 3 [s] and 1 [l])
The measure of vocal effort
Exp. 1 Exp. 2 20 listeners 20 listenersphonated utterances phonated utterances original SPL SPL random +/- 6 dB
Geometric means of distances in meters
Real 0.375 1.5 7.5 37.5 187.5Estimanted 0.47 0.69 1.9 7.5 31
Exp. 2 (dep.) vs. Exp. 1 (indep.): r = 0.993, slope = 0.93. Estimated (dep.) vs. real distance (indep.): r = 0.90
Rundlöf J. (1996). Perceptuella ledtrådar vid auditiv bedömning av avståndet mellan talare och lyssnare D-uppsats, lingvistik, SU.
Extrinsic factors
(1) Communicational distance 2log(distance in meters)
(2) “Closeness" e(1-n) (see Fig. 1)
(3) Wind noise (wind velocity in m/s)
(4) Speaker age: 2log(age in years),
(5) Boyhood (1, 0)
(6) Manhood (1, 0)
(7) Speaker-specific constants (speaker specific average prediction error)
FIG. 1. The average sound pressure level (SPLv), with an arbitrary reference, of the voiced and potentially voiced segments in the phonated and whispered utterances produced by men (), women (), boys (), and girls ().
FIG. 1
Distance (m)
6543210
SP
Lv (
dB
)
90
80
70
60
50
40
30
TYP2
44
33
22
11
4
3
2
1
phonated
whispered
0.3 1.5 7.5 37.5 187.5
FIG. 2. The contribution of the environmental and speaker specific factors (1) communicational distance, (2) “closeness” (3) wind noise, (4) speaker age, (5) boyhood, (6) manhood, and (7) speaker-specific constants, to the variation in acoustic variables measured in the phonated utterances. These variables were (from left to right) SPLv, SPL0, spectral emphasis (SPLv–SPL0), SPLs, utterance average F0, F1a, F3, and the durations of vowel–like (durV) and consonantal segments (durC.).
F IG . 2
Acoustic properties
durcdurvF3F1aF0SPLsEmph.SPLoSPLv
Stan
dard
ized w
eight
1.5
1.0
.5
0.0
-.5
-1.0
Lg2(d)
Closeness
Wind noise
Lg2(age)
Boyhood
Manhood
Speaker spec. const.
Sound pressure levels
The dependent variables were SPLv, SPL0, spectral emphasis (SPLv–SPL0), and SPLs, for all of which the effect is expressed in dB.
SPLv SPL0 Emph. SPLs
r2 0.94 0.92 0.79 0.79r2, speaker specific 0.98 0.96 0.90 0.88Reference value 58.6 dB 53.8 dB 4.9 dB 47.7 dB1. Distance doubled +4.6 * +3.3 * +1.4 * +2.0 * " fivefolded +10.8 * +7.6 * +3.2 * +4.6 *2. “Closeness” 0.3 vs. 1.5 m +9.5 * +7.0 * +2.6 * +2.6 * " 0.3 m vs. +15.0 * +11.0 * +4.1 * +4.1 *3. Wind velocity +1 m/s +0.6 * +0.5 * +0.1 * +0.6 *4. Speaker age 30 vs. 7.5 yrs. +3.7 * +2.7 * +1.0 * +9.0 *5. Boy +4.2 * +4.2 * –0.0 * +6.4 *6. Man –0.5 * –1.1 * +0.7 * +2.2 *
Table III. Occurrence of creaky voice, in % of the total duration of the voiced segments.
0.3 1.5 7.5 38.5 187.5 m
Men 1.4 7.8 0.5 0.7 0.0 2.1Women 7.1 4.4 1.7 0.5 0.0 2.7
F0 and formant frequencies
The dependent variables were F0, F1 of the [a]-segments, and F3 of the voiced segments, for all of which the effect is expressed as a factor.
F0 F1a F3
r2 0.91 0.84 0.93r2, speaker specific 0.97 0.93 0.97Reference value 175 Hz 580 Hz 2687 Hz1. Distance doubled 1.13 * 1.08 * 1.00 * " fivefolded 1.37 * 1.19 * 1.01 *2. “Closeness” 0.3 vs. 1.5 m 1.36 * 1.09 * 1.01 * " 0.3 m vs. 1.63 * 1.15 * 1.02 *3. Wind velocity +1 m/s 1.04 * 1.03 * 1.01 *4. Speaker age 30 vs. 7.5 yrs. 0.74 * 0.79 * 0.75 *5. Boy 1.00 * 1.05 * 1.00 *6. Man 0.61 * 0.84 * 0.88 *
Table IV. Mean values and standard deviations of F0 as a function of distance. Standard deviations also expressed in semitones.
m Men Hz st Women Hz st Boys Hz st Girls Hz st
Segment durations
The dependent variables were the durations of vowel-like (durV) and consonantal segments (durC), for which the effect is expressed as a factor.
durV durC
r2 0.66 0.28r2, speaker specific 0.88 0.63Reference value 58 ms 70 ms1. Distance doubled 1.11 * 1.02 * " fivefolded 1.27 * 1.04 *2. “Closeness” 0.3 vs. 1.5 m 1.35 * 1.17 * " 0.3 m vs. 1.61 * 1.27 *3. Wind velocity +1 m/s 1.05 * 1.00 *4. Speaker age 30 vs. 7.5 yrs. 0.69 * 0.85 *5. Boy 0.99 * 0.93 *6. Man 1.04 * 1.00 *
Table V. The mean pausing time, in ms, in all phonated and whispered utterances after the word listed in the first column.
FIG. 3. The mean of the total pause duration (in ms) in phonated and whispered utterances shown as a function of the communicational distance for men (), women (), boys (), and girls ().
FIG. 3
Distance (m)
876543210
Pa
use
du
ratio
n (
ms)
1000
800
600
400
200
0
MODE
phonated whispered
0.3 1.5 7.5 37.5 187.5 0.3 1.5
Position Girls Boys Women Men
Jag 0 0 0 0tog 252 68 51 29ett 10 * 12 13 10violett, 167 192 237 146åtta 3 0 0 0svarta 64 160 148 65och 20 16 0 0sex 6 17 6 15vita. 522 465 455 265
FIG. 4. SPLv (above), SPLs (middle), and the spectral emphasis SPLv–SPL0 (below) shown as a function of vocal effort level VEL = 2log(d), where d is the perceived communicational distance in meters. Regression lines fitted to the whole set of data for SPLv and emphasis, and to those obtained from each speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted) for SPLs.
Men SPLv = 20.956 + 1.556 VEL (r = 0.99)Women SPLv = 20.413 + 1.609 VEL (r = 0.99)Boys SPLv = 21.901 + 1.477 VEL (r = 0.98)Girls SPLv = 20.631 + 1.490 VEL (r = 0.98)
Men SPLs = 17.199 + 1.120 VEL (r = 0.94)Women SPLs = 16.585 + 0.696 VEL (r = 0.92)Boys SPLs = 16.244 + 0.623 VEL (r = 0.72)Girls SPLs = 14.087 + 0.391 VEL (r = 0.83)
Men SPLv–SPL0 = 2.275 + 0.435 VEL (r = 0.88)Women SPLv–SPL0 = 1.618 + 0.553 VEL (r = 0.95)Boys SPLv–SPL0 = 1.973 + 0.373 VEL (r = 0.92)Girls SPLv–SPL0 = 1.901 + 0.522 VEL (r = 0.92)
FIG. 4
Vocal Effort Level
86420-2-4
SP
Lv,
SP
Ls, E
mphasis
(dB
)
100
90
80
70
60
50
40
30
20
10
0
TYP
80
24
23
21
20
14
13
11
10
4
3
1
0
Vocal Effort Level
Fig. 5. Mean values of F0, F1a, and F3, shown as a function of VEL for men (), women (), boys (), and girls (). Regression lines fitted to each variable (solid, dotted, broken lines) and speaker group.
Equations of the regression lines:
Men 2logF0 = 6.918 + 0.217 VEL (r = 0.98)Women 2logF0 = 7.792 + 0.162 VEL (r = 0.94)Boys 2logF0 = 8.248 + 0.154 VEL (r = 0.93)Girls 2logF0 = 8.331 + 0.132 VEL (r = 0.95)
Men 2logF1a = 9.126 + 0.095 VEL (r = 0.91)Women 2logF1a = 9.368 + 0.128 VEL (r = 0.95)Boys 2logF1a = 9.764 + 0.155 VEL (r = 0.93)Girls 2logF1a = 9.746 + 0.172 VEL (r = 0.94)
Men 2logF3 = 11.217 + 0.003 VEL (r = 0.12)Women 2logF3 = 11.473 + 0.017 VEL (r = 0.55)Boys 2logF3 = 11.857 + 0.000 VEL (r = 0.02)Girls 2logF3 = 11.871 + 0.006 VEL (r = 0.18)
Vocal Effort Level
86420-2-4
F0
, F
1a
, F
3 (
Hz)
4000
2000
1000
800
600
400
200
100
202
42
41
32
31
12
11
10
4
3
Boys F0
Girls F0
Vocal Effort Level
Fig. 6. Mean values of F0 , F1 of the [a]-segments, and F3, plotted as a function of F0. Regression lines shown for each variable and speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted).
FIG. 6
F0 (Hz)
600
500
400
300
200
100
90
F0,
F1a,
F3 (
Hz)
4000
2000
1000
800
600
400
200
100
Girls F3
Women F3
Women F1a
Men F3
Men F1a
Boys F3
Boys F1a
Girls F1a
For a 100% increase in F0,
F1a increased by
42% for men (r = 0.90),71% for women (r = 0.92),95% for boys (r = 0.94),124% for girls (r = 0.94).
There is a positive correlation between F1 and F0
(large effect)
in realizations of the same linguistic strings
by speakers who differ in age and/or sex,
and by the same speakers who alter their pitch register.
“Intrinsic pitch”: a negative correlation between F1
and F0 (small effect)
in vowels produced by a given speaker in the same linguistic and paralinguistic context.
Increases in vocal effort involve simultaneously:
> subglottal pressure ( > SPL, … )> vocal fold tension, ( > F0, … )
> vocal tract openness ( > F1, … )
Recognition of vocal effort
Correlation coefficients of acoustic variables with vocal effort level (VEL)SPL0 0.95 (exceptional)
SPLv 0.98 (exceptional)
(SPLv–SPL0 ) 0.90
F0 and F3 0.87
F0, F3, and Emph 0.96
F0, F3, Emph, 2log(durV/durC) 0.97 (std.err of est. 0.64 units)
Whispering [no F0, no spectral emphasis]
F3, F1a, and 2log(durV /durC) 0.90
Fig. 7. Mean durations of vowel-like segments (above) and consonantal segments (below) shown as a function of VEL. Locally weighted least squares regression lines fitted to the data obtained from each speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted).
Equations for 2log(durV/durC):
Men 2log(durV/durC) = 0.117 0.122 VEL (r = 0.82)Women 2log(durV/durC) = 0.066 0.149 VEL (r = 0.84)Boys 2log(durV/durC) = 0.410 0.147 VEL (r = 0.90)Girls 2log(durV/durC) = 0.382 0.108 VEL (r = 0.71)
F IG . 7
HEJSAN
86420-2-4
VCDU
R
200
100908070
60
50
40
Vocal Effort Level
86420-2-4
Segm
ent d
urat
ion
(ms)
200
100908070
60
50
40
Table VI. Mean values and standard deviations of differences between whispered and voiced versions of the same utterance produced by the same speakers at the same communicational distance (0.3 and 1.5 m). The significance level of the difference between the age groups is also indicated.
Adults Children Sign.
n 23 15
SPLv 17.7 4.5 dB 20.8 2.0 dB **
SPLs 0.7 2.7 dB 4.6 2.8 dB ***
F1a +24 12% +26 12% n.s.F3 +5.1 4 % +3.3 6.3 % n.s.durV +16 17 % +7 23 % n.s.durC +11 14 % 14 21 % ***
Table VII. Mean perceived and calculated distances between speaker and addressee for the phonated versions compared with distances calculated using the same equations for the whispered versions. The independent variables were F1a, F3, durV, and durC.
Perc. dist. (m), phonated .47 .69 1.9 7.5 31Calc. dist. (m), phonated .52 .82 2.0 7.7 22Calc. dist. (m), whispered 2.0 3.3
Fig. 8. The gross difference in spectral energy distribution between whispered and phonated versions of the same utterance produced by men (), women (), boys (), and girls () at the same communicational distance (0.3 and 1.5 m), based on level measurements in frequency bands covering 3 critical bands with overlap.
FIG. 8
Center frequency (Hz)
10
00
08
00
0
60
00
40
00
20
00
10
00
80
0
60
0
40
0
20
0
10
0
Leve
l diff
eren
ce (
dB)
10
0
-10
-20
-30
TYPE
w
m
g
b
Center frequency (Hz)