Download - Acoustic effects of variation in vocal effort by men, women and children

Acoustic effects of variation in vocal effort by men, women and children

Hartmut Traunmüller and Anders Eriksson

assisted by

Anita Andersson, Ingegerd Eklund and Jessika Rundlöv

with financial support from HSFR and NUTEK for the period 94-07-01 -- 96-06-31 and from SU

Acoustic properties of speech sounds vary because of

linguistic - expressive - organic - perspectival

factors

This investigation is mainly concerned with

expressive variation–vocal effort–mode of phonation (whispering vs. phonating).

and interactions with

organic variation–age–sex

(men, women, children)

Vocal effort: a subjective, physiological quantity

Voice level: an acoustic quantity (SPL of a standard utterance measured at a standard distance)

Alternative ways of controlling voice level:

Trained speaker's/singer's technique–More variation in pulmonic pressure–F0 less affected

Ordinary speaker's technique–More variation in vocal fold tension–F0 more affected

Adopted definition & quantification of "vocal effort"

“Vocal effort” = The quantity that ordinary speakers vary when they adapt their speech to the demands of an increased or decreased communicational distance.

Adjusting "loudness level”(Holmberg, Hillman and Perkell, 1988)

Shouting(Rostolland, 1982)

Speaking in noise(Rastatter and Rivers, 1983; Loren, Colcord, and Rastatter, 1986; Van Summers, Pisoni, Bernacki, Pedlow, and Stokes, 1988; Bond, Moore, and Gable, 1989).

Different effects of white and multitalker noise with same SPL

(Rivers and Rastatter, 1985)

Variation in vocal effort affects the shape of the glottal pulses(vocal fold closing velocity and relative closed interval duration)

(Holmberg et al., 1988; Holmberg, Hillman, Perkell, Guiod, and

Goldman, 1995; Södersten, Hertegård and Hammarberg, 1995).

... reflected in the spectral emphasis of the partials above the first

(Gauffin and Sundberg, 1989; Childers and Lee, 1991; Granström

and Nord, 1992)

Variation in vocal effort affects F1(Frøkjær-Jensen, 1966; Rostolland and Parant, 1974; Schulman, 1985; Bond et al., 1989; Liénard and Di Benedetto, 1999).

F1 difficult to measurebut more open mouth >> higher F1

Variation in vocal effort affects segment durations

(Fónagy and Fónagy, 1966, Rostolland, 1982, and Bonnot and Chevrie-Muller, 1991)

larger effort: longer vowels but somewhat shorter consonants

SPL as a measure of vocal effort?(Liénard and Di Benedetto, 1999)

SPL plays no major part in judgments of vocal effort

(Rundlöf, 1996; Traunmüller, 1997; Eriksson and Traunmüller, 1999)

SPL varies widely as a function of perspectival factors.

Listeners distinguish variations in a speaker’s vocal effort from variations in their own distance from the speaker.

(Wilkens and Bartel, 1977, Eriksson and Traunmüller, 1999)

Our measure of vocal effort:The average rating, by a group of listeners, of the communicational distance for each stimulus.

Our aim:Acquire detailed quantitative knowledge about those acoustic effects of variations in vocal effort that are of perceptual importance.

Relevant to:

Speech synthesis with desired paralinguistic quality

Automatic recognition of linguistic information

Automatic recognition of expressive information

Automatic recognition of organic information

Conversion of paralinguistic quality

Automatic speech-to-speech translation with conserved paralinguistic quality

Theories of speech: The Modulation Theory

Subjects

6 male adults, 20–51 years6 female adults, 20–38 years4 boys, 7 years4 girls, 7 years

all speaking Stockholm Swedish

Speech material

Anita: “Hur många kort tog du av varje färg?”

Jag tog ett violett, åtta svarta och sex vita [ ]

5 phonated and 2 whispered versions

Recording Place: Långängen, Lidingö

DAT-recorder

High quality microphone, wind protected, 50 mm from speaker's lips

Stepwise attenuator 0, 8, 16, 24, and 32 dB

Sampling at 16 kHz, 16 bits per sample

HP-filtering at 70 Hz, 48 dB/octave

ESPS/Waves

For formant frequency measurements resampled at 6.4 kHz for men, 8 kHz for women and 10.667 kHz for children.

Table I. Distances between speaker and addressee. The full range was used for phonated speech. Whispered speech was only used at the two shortest distances.

Version 1 2 3 4 5

Distance (m) 0.3 1.5 7.5 37.5 187.5

Acoustic measurementsSound pressure levels

SPLV (voiced segments & potentially voiced)

SPLS (three [s]-es)

SPL0 (voiced segments LP filtered at 1.5 F0mean, 18 dB/oct.)

Spectral emphasis SPLV - SPL0

Fundamental frequency F0 (mean and SD, excl. creaky voiced sections)

Formant frequencies F1a (average of four [a]-s)

F3 (average of voiced segments & potentially voiced)

Segment durations durV (average of 14 vowels, 3 [v] and 1 [j])

durC (average of 8 stops, 3 [s] and 1 [l])

The measure of vocal effort

Exp. 1 Exp. 2 20 listeners 20 listenersphonated utterances phonated utterances original SPL SPL random +/- 6 dB

Geometric means of distances in meters

Real 0.375 1.5 7.5 37.5 187.5Estimanted 0.47 0.69 1.9 7.5 31

Exp. 2 (dep.) vs. Exp. 1 (indep.): r = 0.993, slope = 0.93. Estimated (dep.) vs. real distance (indep.): r = 0.90

Rundlöf J. (1996). Perceptuella ledtrådar vid auditiv bedömning av avståndet mellan talare och lyssnare D-uppsats, lingvistik, SU.

Extrinsic factors

(1) Communicational distance 2log(distance in meters)

(2) “Closeness" e(1-n) (see Fig. 1)

(3) Wind noise (wind velocity in m/s)

(4) Speaker age: 2log(age in years),

(5) Boyhood (1, 0)

(6) Manhood (1, 0)

(7) Speaker-specific constants (speaker specific average prediction error)

FIG. 1. The average sound pressure level (SPLv), with an arbitrary reference, of the voiced and potentially voiced segments in the phonated and whispered utterances produced by men (), women (), boys (), and girls ().

FIG. 1

Distance (m)

6543210

SP

Lv (

dB

)

90

80

70

60

50

40

30

TYP2

44

33

22

11

4

3

2

1

phonated

whispered

0.3 1.5 7.5 37.5 187.5

FIG. 2. The contribution of the environmental and speaker specific factors (1) communicational distance, (2) “closeness” (3) wind noise, (4) speaker age, (5) boyhood, (6) manhood, and (7) speaker-specific constants, to the variation in acoustic variables measured in the phonated utterances. These variables were (from left to right) SPLv, SPL0, spectral emphasis (SPLv–SPL0), SPLs, utterance average F0, F1a, F3, and the durations of vowel–like (durV) and consonantal segments (durC.).

F IG . 2

Acoustic properties

durcdurvF3F1aF0SPLsEmph.SPLoSPLv

Stan

dard

ized w

eight

1.5

1.0

.5

0.0

-.5

-1.0

Lg2(d)

Closeness

Wind noise

Lg2(age)

Boyhood

Manhood

Speaker spec. const.

Sound pressure levels

The dependent variables were SPLv, SPL0, spectral emphasis (SPLv–SPL0), and SPLs, for all of which the effect is expressed in dB.

SPLv SPL0 Emph. SPLs

r2 0.94 0.92 0.79 0.79r2, speaker specific 0.98 0.96 0.90 0.88Reference value 58.6 dB 53.8 dB 4.9 dB 47.7 dB1. Distance doubled +4.6 * +3.3 * +1.4 * +2.0 * " fivefolded +10.8 * +7.6 * +3.2 * +4.6 *2. “Closeness” 0.3 vs. 1.5 m +9.5 * +7.0 * +2.6 * +2.6 * " 0.3 m vs. +15.0 * +11.0 * +4.1 * +4.1 *3. Wind velocity +1 m/s +0.6 * +0.5 * +0.1 * +0.6 *4. Speaker age 30 vs. 7.5 yrs. +3.7 * +2.7 * +1.0 * +9.0 *5. Boy +4.2 * +4.2 * –0.0 * +6.4 *6. Man –0.5 * –1.1 * +0.7 * +2.2 *

Table III. Occurrence of creaky voice, in % of the total duration of the voiced segments.

0.3 1.5 7.5 38.5 187.5 m

Men 1.4 7.8 0.5 0.7 0.0 2.1Women 7.1 4.4 1.7 0.5 0.0 2.7

F0 and formant frequencies

The dependent variables were F0, F1 of the [a]-segments, and F3 of the voiced segments, for all of which the effect is expressed as a factor.

F0 F1a F3

r2 0.91 0.84 0.93r2, speaker specific 0.97 0.93 0.97Reference value 175 Hz 580 Hz 2687 Hz1. Distance doubled 1.13 * 1.08 * 1.00 * " fivefolded 1.37 * 1.19 * 1.01 *2. “Closeness” 0.3 vs. 1.5 m 1.36 * 1.09 * 1.01 * " 0.3 m vs. 1.63 * 1.15 * 1.02 *3. Wind velocity +1 m/s 1.04 * 1.03 * 1.01 *4. Speaker age 30 vs. 7.5 yrs. 0.74 * 0.79 * 0.75 *5. Boy 1.00 * 1.05 * 1.00 *6. Man 0.61 * 0.84 * 0.88 *

Table IV. Mean values and standard deviations of F0 as a function of distance. Standard deviations also expressed in semitones.

m Men Hz st Women Hz st Boys Hz st Girls Hz st

Segment durations

The dependent variables were the durations of vowel-like (durV) and consonantal segments (durC), for which the effect is expressed as a factor.

durV durC

r2 0.66 0.28r2, speaker specific 0.88 0.63Reference value 58 ms 70 ms1. Distance doubled 1.11 * 1.02 * " fivefolded 1.27 * 1.04 *2. “Closeness” 0.3 vs. 1.5 m 1.35 * 1.17 * " 0.3 m vs. 1.61 * 1.27 *3. Wind velocity +1 m/s 1.05 * 1.00 *4. Speaker age 30 vs. 7.5 yrs. 0.69 * 0.85 *5. Boy 0.99 * 0.93 *6. Man 1.04 * 1.00 *

Table V. The mean pausing time, in ms, in all phonated and whispered utterances after the word listed in the first column.

FIG. 3. The mean of the total pause duration (in ms) in phonated and whispered utterances shown as a function of the communicational distance for men (), women (), boys (), and girls ().

FIG. 3

Distance (m)

876543210

Pa

use

du

ratio

n (

ms)

1000

800

600

400

200

0

MODE

phonated whispered

0.3 1.5 7.5 37.5 187.5 0.3 1.5

Position Girls Boys Women Men

Jag 0 0 0 0tog 252 68 51 29ett 10 * 12 13 10violett, 167 192 237 146åtta 3 0 0 0svarta 64 160 148 65och 20 16 0 0sex 6 17 6 15vita. 522 465 455 265

FIG. 4. SPLv (above), SPLs (middle), and the spectral emphasis SPLv–SPL0 (below) shown as a function of vocal effort level VEL = 2log(d), where d is the perceived communicational distance in meters. Regression lines fitted to the whole set of data for SPLv and emphasis, and to those obtained from each speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted) for SPLs.

Men SPLv = 20.956 + 1.556 VEL (r = 0.99)Women SPLv = 20.413 + 1.609 VEL (r = 0.99)Boys SPLv = 21.901 + 1.477 VEL (r = 0.98)Girls SPLv = 20.631 + 1.490 VEL (r = 0.98)

Men SPLs = 17.199 + 1.120 VEL (r = 0.94)Women SPLs = 16.585 + 0.696 VEL (r = 0.92)Boys SPLs = 16.244 + 0.623 VEL (r = 0.72)Girls SPLs = 14.087 + 0.391 VEL (r = 0.83)

Men SPLv–SPL0 = 2.275 + 0.435 VEL (r = 0.88)Women SPLv–SPL0 = 1.618 + 0.553 VEL (r = 0.95)Boys SPLv–SPL0 = 1.973 + 0.373 VEL (r = 0.92)Girls SPLv–SPL0 = 1.901 + 0.522 VEL (r = 0.92)

FIG. 4

Vocal Effort Level

86420-2-4

SP

Lv,

SP

Ls, E

mphasis

(dB

)

100

90

80

70

60

50

40

30

20

10

0

TYP

80

24

23

21

20

14

13

11

10

4

3

1

0

Vocal Effort Level

Fig. 5. Mean values of F0, F1a, and F3, shown as a function of VEL for men (), women (), boys (), and girls (). Regression lines fitted to each variable (solid, dotted, broken lines) and speaker group.

Equations of the regression lines:

Men 2logF0 = 6.918 + 0.217 VEL (r = 0.98)Women 2logF0 = 7.792 + 0.162 VEL (r = 0.94)Boys 2logF0 = 8.248 + 0.154 VEL (r = 0.93)Girls 2logF0 = 8.331 + 0.132 VEL (r = 0.95)

Men 2logF1a = 9.126 + 0.095 VEL (r = 0.91)Women 2logF1a = 9.368 + 0.128 VEL (r = 0.95)Boys 2logF1a = 9.764 + 0.155 VEL (r = 0.93)Girls 2logF1a = 9.746 + 0.172 VEL (r = 0.94)

Men 2logF3 = 11.217 + 0.003 VEL (r = 0.12)Women 2logF3 = 11.473 + 0.017 VEL (r = 0.55)Boys 2logF3 = 11.857 + 0.000 VEL (r = 0.02)Girls 2logF3 = 11.871 + 0.006 VEL (r = 0.18)

Vocal Effort Level

86420-2-4

F0

, F

1a

, F

3 (

Hz)

4000

2000

1000

800

600

400

200

100

202

42

41

32

31

12

11

10

4

3

Boys F0

Girls F0

Vocal Effort Level

Fig. 6. Mean values of F0 , F1 of the [a]-segments, and F3, plotted as a function of F0. Regression lines shown for each variable and speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted).

FIG. 6

F0 (Hz)

600

500

400

300

200

100

90

F0,

F1a,

F3 (

Hz)

4000

2000

1000

800

600

400

200

100

Girls F3

Women F3

Women F1a

Men F3

Men F1a

Boys F3

Boys F1a

Girls F1a

For a 100% increase in F0,

F1a increased by

42% for men (r = 0.90),71% for women (r = 0.92),95% for boys (r = 0.94),124% for girls (r = 0.94).

There is a positive correlation between F1 and F0

(large effect)

in realizations of the same linguistic strings

by speakers who differ in age and/or sex,

and by the same speakers who alter their pitch register.

“Intrinsic pitch”: a negative correlation between F1

and F0 (small effect)

in vowels produced by a given speaker in the same linguistic and paralinguistic context.

Increases in vocal effort involve simultaneously:

> subglottal pressure ( > SPL, … )> vocal fold tension, ( > F0, … )

> vocal tract openness ( > F1, … )

Recognition of vocal effort

Correlation coefficients of acoustic variables with vocal effort level (VEL)SPL0 0.95 (exceptional)

SPLv 0.98 (exceptional)

(SPLv–SPL0 ) 0.90

F0 and F3 0.87

F0, F3, and Emph 0.96

F0, F3, Emph, 2log(durV/durC) 0.97 (std.err of est. 0.64 units)

Whispering [no F0, no spectral emphasis]

F3, F1a, and 2log(durV /durC) 0.90

Fig. 7. Mean durations of vowel-like segments (above) and consonantal segments (below) shown as a function of VEL. Locally weighted least squares regression lines fitted to the data obtained from each speaker group, men (, solid lines), women (, broken), boys (, dashed), and girls (, dotted).

Equations for 2log(durV/durC):

Men 2log(durV/durC) = 0.117 0.122 VEL (r = 0.82)Women 2log(durV/durC) = 0.066 0.149 VEL (r = 0.84)Boys 2log(durV/durC) = 0.410 0.147 VEL (r = 0.90)Girls 2log(durV/durC) = 0.382 0.108 VEL (r = 0.71)

F IG . 7

HEJSAN

86420-2-4

VCDU

R

200

100908070

60

50

40

Vocal Effort Level

86420-2-4

Segm

ent d

urat

ion

(ms)

200

100908070

60

50

40

Table VI. Mean values and standard deviations of differences between whispered and voiced versions of the same utterance produced by the same speakers at the same communicational distance (0.3 and 1.5 m). The significance level of the difference between the age groups is also indicated.

Adults Children Sign.

n 23 15

SPLv 17.7 4.5 dB 20.8 2.0 dB **

SPLs 0.7 2.7 dB 4.6 2.8 dB ***

F1a +24 12% +26 12% n.s.F3 +5.1 4 % +3.3 6.3 % n.s.durV +16 17 % +7 23 % n.s.durC +11 14 % 14 21 % ***

Table VII. Mean perceived and calculated distances between speaker and addressee for the phonated versions compared with distances calculated using the same equations for the whispered versions. The independent variables were F1a, F3, durV, and durC.

Perc. dist. (m), phonated .47 .69 1.9 7.5 31Calc. dist. (m), phonated .52 .82 2.0 7.7 22Calc. dist. (m), whispered 2.0 3.3

Fig. 8. The gross difference in spectral energy distribution between whispered and phonated versions of the same utterance produced by men (), women (), boys (), and girls () at the same communicational distance (0.3 and 1.5 m), based on level measurements in frequency bands covering 3 critical bands with overlap.

FIG. 8

Center frequency (Hz)

10

00

08

00

0

60

00

40

00

20

00

10

00

80

0

60

0

40

0

20

0

10

0

Leve

l diff

eren

ce (

dB)

10

0

-10

-20

-30

TYPE

w

m

g

b

Center frequency (Hz)