+ All Categories
Home > Documents > Discovering the acoustic correlates of phonological contrasts

Discovering the acoustic correlates of phonological contrasts

Date post: 30-Oct-2016
Category:
Upload: john-coleman
View: 213 times
Download: 1 times
Share this document with a friend
22
www.elsevier.com/locate/phonetics Journal of Phonetics 31 (2003) 351–372 Discovering the acoustic correlates of phonological contrasts John Coleman Phonetics Laboratory, University of Oxford, 41 Wellington Square Oxford OX1 2JF, UK Received 17 September 2002; received in revised form 16 July 2003; accepted 31 July 2003 Abstract Recently, some researchers have argued that words are stored in the brain as numerous, detailed, variant exemplars. The promoters of this view must demonstrate how phonological contrasts could be inferred from multiple, detailed phonetic exemplars. I describe a signal-processing method that does just that. It is also evident that many subtle aspects of phonological contrasts remain to be properly examined, and more may await discovery. A database of English words, each spoken in five repetitions by a single speaker, was therefore ‘‘mined’’ for further data on the correlates of every phonemic contrast in English, in an attempt to discover local and long-distance effects. Most of the correlates found were local contrasts, consistent with prior studies. The most interesting results concerned longer-distance coarticulatory correlates of two features, [voice] and [anterior]. In many word pairs contrasting in ‘‘final consonant voicing’’, extensive anticipatory coarticulation was found as early as the word-initial consonants, and often in the vowel at the end of the previous word. In many instances of the [anterior] contrast, acoustic differences were found in the onset of the preceding syllable. It is surprising to find such phenomena in English, and shows some of the limitations of our knowledge of speech. Implications are discussed for segmental phonological theories and exemplar-based models of lexical representation. r 2003 Elsevier Ltd. All rights reserved. 1. Introduction Recently, a number of speech researchers have argued that words are stored in the mental lexicon in the form of numerous, detailed, variant exemplars (see e.g., Hooper, 1981; Pisoni, 1997a; Goldinger, 1997; Johnson, 1997; Bybee, 2000). This view is offered as an alternative to the more conventional view that variant tokens of the pronunciation of a word undergo some sort of normalization to a single, clean, prototypical, usually symbolic/phonemic representation prior to lexical access. In support of the exemplar-based alternative, I am struck by the force of supportive evidence from lexical access studies such as Gaskell and Marslen-Wilson (1996) and D! emonet, ARTICLE IN PRESS E-mail address: [email protected] (J. Coleman). 0095-4470/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.wocn.2003.10.001
Transcript
Page 1: Discovering the acoustic correlates of phonological contrasts

www.elsevier.com/locate/phonetics

Journal of Phonetics 31 (2003) 351–372

Discovering the acoustic correlates of phonological contrasts

John Coleman

Phonetics Laboratory, University of Oxford, 41 Wellington Square Oxford OX1 2JF, UK

Received 17 September 2002; received in revised form 16 July 2003; accepted 31 July 2003

Abstract

Recently, some researchers have argued that words are stored in the brain as numerous, detailed, variantexemplars. The promoters of this view must demonstrate how phonological contrasts could be inferredfrom multiple, detailed phonetic exemplars. I describe a signal-processing method that does just that. It isalso evident that many subtle aspects of phonological contrasts remain to be properly examined, and moremay await discovery. A database of English words, each spoken in five repetitions by a single speaker, wastherefore ‘‘mined’’ for further data on the correlates of every phonemic contrast in English, in an attempt todiscover local and long-distance effects. Most of the correlates found were local contrasts, consistent withprior studies. The most interesting results concerned longer-distance coarticulatory correlates of twofeatures, [voice] and [anterior]. In many word pairs contrasting in ‘‘final consonant voicing’’, extensiveanticipatory coarticulation was found as early as the word-initial consonants, and often in the vowel at theend of the previous word. In many instances of the [anterior] contrast, acoustic differences were found inthe onset of the preceding syllable. It is surprising to find such phenomena in English, and shows some ofthe limitations of our knowledge of speech. Implications are discussed for segmental phonological theoriesand exemplar-based models of lexical representation.r 2003 Elsevier Ltd. All rights reserved.

1. Introduction

Recently, a number of speech researchers have argued that words are stored in the mentallexicon in the form of numerous, detailed, variant exemplars (see e.g., Hooper, 1981; Pisoni,1997a; Goldinger, 1997; Johnson, 1997; Bybee, 2000). This view is offered as an alternative to themore conventional view that variant tokens of the pronunciation of a word undergo some sort ofnormalization to a single, clean, prototypical, usually symbolic/phonemic representation prior tolexical access. In support of the exemplar-based alternative, I am struck by the force of supportiveevidence from lexical access studies such as Gaskell and Marslen-Wilson (1996) and D!emonet,

ARTICLE IN PRESS

E-mail address: [email protected] (J. Coleman).

0095-4470/$ - see front matter r 2003 Elsevier Ltd. All rights reserved.

doi:10.1016/j.wocn.2003.10.001

Page 2: Discovering the acoustic correlates of phonological contrasts

Thierry, and Nespoulous (2002), which seem to demonstrate that phonemic classification follows,rather than precedes, lexical access.Phonemic classification is, nevertheless, an ability of many speakers, especially alphabetic

literates. There is little doubt that unimpaired listeners who know English can determine that dog

and dock contrast in both sound and meaning. Consequently, those of us who argue for anexemplar-based view of the lexicon must also demonstrate how phonological contrasts could bederived or inferred on the basis of multiple tokens of words, stored as detailed phoneticrepresentations. In this paper I shall describe a signal-processing method that more-or-less doesjust that. Although I do not offer it as a psycholinguistic model of phonological categorization, itdoes provide an ‘‘in principle’’ demonstration of how phonological categories may be determined.The concept of ‘‘meaningful contrast’’ lies at the heart of every phonological theory. Indeed, the

distinction between phonology and phonetics rests on the observation that some phoneticdistinctions are meaningful and others are not. For example, consider two tokens of the word lent,one spoken with little or no overlap of nasal opening and oral closure (1a), the other with aninterval of simultaneous oral closure and nasal opening (1b):

(1) (a) [1*e<th](b) [1*en<th](c) [1*e7n7t]

Despite the fact that (1a) and (b) are phonetically different, if they are instances of the sameword, lent, they will be given the same phonological representation. In a phonemic phonology forexample, they might both be represented as /lent/. (1c), representing an instance of the word lend,is also phonetically different from (1a) and (b), in several respects. Phonological theories differ asto which aspects of such a difference are important. In this case, phonemic phonology focuses onthe end of the word, and represents (c) as /lend/, making the distinction between final /t/ and /d/the principal difference. The phonemes /l/, /e/ and /n/ are common to both representations, eventhough /e/ and /n/ are also phonetically different: these are regarded as predictable contextualvariations, contingent on the /t/–/d/ distinction, and thus unimportant. In a (Firthian) ProsodicAnalysis of these examples, we might again regard the many differences between lent and lend asfacets of a single phonological contrast—let us call it (following Firth & Rogers, 1937; Henderson,1948) the aspiration difference, represented as the prosodic element h in word-final position. (Thedifference between 1(a) and (b) shows that the precise synchronization of nasality with oralclosure seems not to be significant.) Coupled with the fact that nasality is also observed in thevowel, a Firthian analysis, such as (2), might have nasality as a prosody of the word-ending, too:

(2) Final nasality absent Final nasality presentFinal aspiration leth let le*th lentAbsence of final aspiration let led le*t lend

The exponency (i.e., phonetic realization) of final h is many-faceted, and includes shortness ofthe vowel, possible pre-glottalization of the stop, possible aspiration of the stop, thenonoccurrence of voicing of the stop (even in carefully enunciated speech), and shortness of thenasal closure, to the point of possible nonoccurrence.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372352

Page 3: Discovering the acoustic correlates of phonological contrasts

A feature-based analysis, as found in standard generative phonology, has some of thecharacteristics of both the phonemic and the prosodic analysis. Like the prosodic analysis, thedifference between lent and lend is represented as a difference of a single feature in word-finalposition: [7voice], [7tense], or [7spread glottis], for example, depending on the set of featurespreferred. Like the phonemic analysis, one phonetic dimension, phonation, is emphasized at theexpense of all the other differences.Although phonemic phonology and feature theory focus on single differences, half a century of

research into the articulatory, acoustic and perceptual characteristics of phonological distinctionshas been enormously fruitful. Comprehensive surveys such as Olive, Greenwood, and Coleman(1993) and Stevens (1998) may give the newcomer to the field the impression that little furtherwork remains to be done, and that further efforts would be better directed elsewhere: to the studyof prosody, perhaps. That conclusion would be premature, however. It has become evident inrecent years that many further, subtler aspects of phonological contrasts remain to be properlyexamined, and more may remain to be discovered. These subtle aspects are small but systematicdifferences in the phonetic realization of contrasts located some distance away from the vowel orconsonant thought (in segmental phonology) to be the principal signifier of the contrast. Forexample, van Santen, Coleman and Randolph (1992) and van Santen (1997) found that the‘‘final’’ voicing contrast, as in (1), is sometimes reflected in slight differences of the word-initialconsonant. In particular, the /l/ of lend may be slightly longer and darker (slightly more velarized)than that of lent. That is, instead of (lc), we may have [1U7*e7n7t]. The extension of the contrast tothe word-initial consonant raises two particular problems for segmental phonological theories:

(3) (a) Nonlocality: it becomes problematic to talk of a ‘‘final’’ contrast if it extends all the waythrough a word, from beginning to end.

(b) Arbitrariness and plurality of exponency: some of the aspects of the distinction (e.g.darkness of the initial /l/) have no discernible connection to other aspects (e.g. absence ofaspiration and pre glottalization; length).

Yet the physical reality of these differences and their perceptual availability to listeners has beencarefully demonstrated many times (Chen, 1970; Hooper, 1977; Lisker, 1986; van Santen et al.,1992; Hawkins & Nguyen, in press a, 2004).A second such example—also involving /l/, as it happens—was first noted by Kelly and Local,

(1986). They observed impressionistically, through listening and through examination ofspectrograms, that the distinction between /l/ and /a/ in minimal pairs such as Telly vs. Terryexhibits subtle phonetic differences in the vowels and consonants of those entire words andneighboring unstressed syllables. These impressions were subsequently experimentally confirmedby Hawkins and Slater (1994), West (1999, 2000a,b), and Tunley (1999). Hawkins and Slater(1994) and Tunley (1999) showed that modeling such subtle details of the /l/–/a/ distinction insynthetic speech improves its intelligibility, and presumably,1 its perceived naturalness. West

ARTICLE IN PRESS

1Although intelligibility and naturalness are sometimes regarded as independent dimensions in speech technology

research, I concur with Klatt (1987) that ‘‘synthesis that is a better match to observed natural data has always sounded

better and has been measurably more intelligible’’. In considering the relationship between naturalness and

intelligibility, Pisoni (1997b) notes that familiarity with the personal, ‘‘indexical’’ features of a speaker’s voice affects

its intelligibility and may contribute to more efficient encoding of the message in memory. Hawkins and Slater (1994)

J. Coleman / Journal of Phonetics 31 (2003) 351–372 353

Page 4: Discovering the acoustic correlates of phonological contrasts

(2000a) established the perceptual availability of the nonlocal differences. (I use the term‘‘availability’’, rather than ‘‘salience’’, because these differences are not very salient at all: the localcorrelates of the /l/–/a/ distinction are much more salient. The availability of the nonlocal cues tothe difference can be demonstrated by replacing the main difference—the /l/ or /a/—with noise; or,as in Hawkins and Nguyen (in press a), by cross-splicing them.) West (1999, 2000b) establishedthe articulatory and acoustic characteristics of the local and nonlocal differences between /l/ and/a/. Heid and Hawkins (2000) showed that a nonlocal difference can be found in one dimension(the fourth formant frequency) within and beyond neighboring stressed syllables, up to fivesyllables away, 0.5–1 s before the lingual constriction of the conditioning /a/ or /l/. For thiscontrast, at least, the coarticulatory influence of /l/ or /a/ is not limited to the segmental context,but extends far beyond. (This is a clear counterexample to the hypothesis advanced by Fowler andSaltzman (1993) that the rise time of an articulatory gesture is typically 200–250ms.)The studies just mentioned consider only a few features: [voice] and [lateral] (or possibly [back]

or [round], according to West, 2000b). Why have these features been singled out for such rigorousscrutiny? There is no good reason: in both cases, the discovery of the nonlocal phonetic differenceswas just serendipitous. To those of us interested in the phonetics–phonology interface, this isworrying. What other surprises lie in store, when we reexamine other contrasts with equalthoroughness? Furthermore, why were the nonlocal differences not noticed much earlier? Thisquestion prompts at least two reasonable answers: (i) because we did not look for them, or (ii)because they are rare: maybe no other features have nonlocal exponents.It is clearly time to set these concerns to rest, by undertaking a comprehensive reexamination of

phonological contrasts, paying attention to the problems of nonlocality, arbitrariness andplurality of exponency described in (3). In particular, in view of the arbitrariness and plurality ofexponency, we must not restrict our attention to any particular phonetic parameters on an a prioribasis. (This is often a source of problems in the interpretation of results in experimental phonetics,anyway. For example, Lahiri and Hankamer (1988) claimed to find that the critical cue togemination in stops is the segmental duration of the closure, in Turkish and Bengali, even thoughthey only examined two other possible cues to the contrast—VOT and vowel duration—both ofwhich are durational measures. They did not examine source or spectral features at all.) Thismeans that we cannot single out specific phonetic parameters in our study. We must attempt towork with complete phonetic representations, and employ analytical methods to identify whichproperties or dimensions are employed in the realization of a contrast. The question of whatconstitutes a complete phonetic representation will be addressed in Section 2. Second, in order toaddress the problem of nonlocality, we must not restrict our attention to single segments and theirimmediate context. In reexamining the differences between, say, ship and sip, we must entertainthe possibility that differences may be found anywhere in the word and perhaps even inneighboring words. (For the sake of tractability, in this study we shall examine one unstressedsyllable either side of the word of interest.)

ARTICLE IN PRESS

(footnote continued)

report a statistic from Pratt (1986) that ‘‘natural speech is about 15% less intelligible at 0 dB s/n than in quiet, whereas

synthetic speech drops by 35%–50% y We conjecture that the fragility of synthetic speech in noise is related to its

unnatural quality.’’

J. Coleman / Journal of Phonetics 31 (2003) 351–372354

Page 5: Discovering the acoustic correlates of phonological contrasts

2. Materials

The speech data examined in this study were drawn from a phonologically rich database of1066 monosyllabic word types spoken by one male Southern British English speaker in acontrolled sentence frame (see Slater and Coleman (1996) for details). Consonant-initialmonosyllables were preceded by ‘‘Can you utter’’, and vowel-initial monosyllables were precededby ‘‘Have you uttered’’. Consonant-final monosyllables were followed by ‘‘again please?’’ andvowel-final monosyllables were followed by ‘‘today please?’’ Recordings were made using DigitalAudio Tape, and were subsequently transferred to a computer at a sampling rate of 16 kHz, at aresolution of 16 bits. Extracts of each sentence were taken from the beginning of the /t/-closure in‘‘utter(ed)’’ to the burst of the medial stop in ‘‘today’’ or ‘‘again’’. The phonotactic structure of allextracts is thus /t = C V C = C/. This database contains a little over 600minimal pairs, of which108 have been analyzed so far. Since there is only one speaker, there is no between-speakervariability, and the relevance of the results of this study for other speakers remains to bedetermined. However, the original purpose of the database was to obtain a relatively small butrich and highly controlled repository of recordings of monosyllables. It is thus eminently suitablefor a ‘‘data mining’’ study such as the present work.The selected extracts were analyzed into 20 acoustic parameters, using a high-quality linear

prediction method, as follows. The filter parameters were determined using the ESPS refcofroutine (Entropics Corporation, Washington, DC) to derive 15 reflection coefficients2 over a10ms rectangular window, with a window overlap of 5ms. The ESPS transpec and get resid

routines were used to translate them into 15 autoregressive filter coefficients (a1–a15) and obtain aresidual signal. The speech signal was also analyzed using the ESPS get f0 routine and additionalsoftware developed in-house to derive five source parameters encoding the voicing and noiseexcitation components. The voice source parameters, f0 and AV (amplitude of voicing), arecontrol parameters for the voicing model used in the original version of the Klatt formantsynthesizer (Klatt, 1980). The noise component of the linear prediction residual was filtered into ahigh and a low frequency band. The ratio of the RMS amplitudes of the noise in each of these twobands, which I call tilt, when suitably scaled, is a good indicator of the amplitude of friction that isrobust for both voiced and voiceless friction. The signal analysis method is illustratedschematically in Fig. 1, and the 20 parameters derived in this way are listed in Table 1.These 20 parameters can be used to reconstruct a synthetic copy of very high fidelity to the

original speech using a combination of several pieces of software. First, the AV and f0 parametersalone may be submitted to the voice source part of the Klatt formant synthesizer to generate agood synthetic estimate of the original voice source. Second, the tilt (amplitude of noise)parameter may be used to scale Gaussian noise, in order to generate the noise component of thesource. When the voice and noise sources are suitably scaled and mixed (i.e., added together), theymay be used in conjunction with the filter coefficients as an excitation signal for linear prediction

ARTICLE IN PRESS

2The use of a 15th-order filter was determined experimentally. A referee observed that for data sampled at 16 kHz it

might be expected to use an 18 pole filter, according to the rule-of-thumb of 2 poles per kHz (up to the Nyquist

frequency) + 2 for the spectral tilt. Although this rule-of-thumb works well below about 12 kHz, practical experience

shows that it is not necessary to add so many extra poles with higher sampling rates, as there is little linguistically

significant information in the signal above 6 kHz. Impressionistically, analysis–resynthesis using 18 poles is not

noticeably better than with 15 poles.

J. Coleman / Journal of Phonetics 31 (2003) 351–372 355

Page 6: Discovering the acoustic correlates of phonological contrasts

synthesis (ESPS lp syn). Although no formal evaluation has been made of the quality of theresulting synthetic speech in this experiment, my experience of a variety of synthesis methods leadsme to the opinion that this is a reliable and workable method of making quite natural-soundingsynthetic encodings of natural speech.There are five tokens of each word in the monosyllable database. One hundred and eight word

pairs were selected for analysis in accordance with the goal of considering several instances of eachphonological contrast, i.e., every distinctive feature, in three syllable positions (initial, medial and

ARTICLE IN PRESS

Table 1

Acoustic analysis parameters employed in this study

Parameter Description

1 f0 Fundamental frequency estimate (Hz)

2 p(voice) ESPS ‘‘probability’’ of voicing estimate (0 or 1)

3 RMS RMS amplitude (based on a 30 ms hanning window)

4 AC peak Peak normalized cross-correlation value found to determine f05 tilt Ratio of high- to low-energy (arbitrary units)

6 a1 First autoregressive filter coefficient: energy at 8 kHz

7 a2 Energy at multiples of 5.3 kHz

8 a3 Energy at multiples of 4 kHz

9 a4 Energy at multiples of 3.2 kHz

10 a5 Energy at multiples of 2.7 kHz

11 a6 Energy at multiples of 2.3 kHz

12 a7 Energy at multiples of 2 kHz

13 a8 Energy at multiples of 1.8 kHz

14 a9 Energy at multiples of 1.6 kHz

15 a10 Energy at multiples of 1.5 kHz

16 a11 Energy at multiples of 1.3 kHz

17 a12 Energy at multiples of 1.2 kHz

18 a13 Energy at multiples of 1.1 kHz

19 a14 Energy at multiples of 1.06 kHz

20 a15 Energy at multiples of 1 kHz

Speech in

15th-orderlinear prediction analysis (10 mswindow)

15 filter coefficients,updated every 5 ms

Linearprediction residual (error signal)

Voicing and f0 analysis

F0 and AV parameters for Klatt synthesizer

Ratio of high- to low-frequencyenergy

tilt, an estimateof the amplitudeof noise

Fig. 1. Method by which spoken utterances were encoded into acoustic parameters.

J. Coleman / Journal of Phonetics 31 (2003) 351–372356

Page 7: Discovering the acoustic correlates of phonological contrasts

final). ‘‘Initial’’ syllable position is synonymous with ‘‘onset’’ and ‘‘medial’’ with ‘‘nucleus’’. Note,however, that word-final consonants in the monosyllable database are not generally syllablecodas, as they are followed by the vowel /=/. According to the most widely cited theory ofsyllabification, therefore, they should also be considered as syllable onsets. (There are only a fewexceptions to this: the word-final /F/ in e.g. rang is generally considered to be a syllable coda, notan onset.) The set of words is not balanced: words containing sonorants, and words illustratingthe [voice] contrast, are deliberately over-represented, in view of the prior work on nonlocalcorrelates of [voice] and the liquids cited in Section 1. The other distinctive features areexemplified by a few word-pairs each, in an attempt to survey the full range of lexical contrasts inEnglish.A complete listing of the word-pairs examined so far, and the phonological contrasts that they

exemplify, is given in Table 2.In addition, three control contrasts were examined:

(i) pat vs. put, a categorial difference between two vowels that are not minimally distinct, beinghigh vs. low, i.e., [+high, –low] vs. [–high, +low].

(ii) lap vs. Lapp. These words are supposed to be homophones, so we hope to find no significantdifferences between their pronunciations.

(iii) lap vs. wrap, using the same recordings as those examined by West (2000b), experiment three.

3. Discovering the acoustic correlates of phonological contrasts

Note that the title of this paper is ‘‘discovering the acoustic correlates of phonologicalcontrasts’’: only the acoustic correlates, not the contrasts, are discovered by the method describedin this section. The essence of the algorithm is this: given five tokens of one word (e.g., pit) and fivetokens of another (e.g., bit), determine for which points in time and for which acoustic parametersthe two groups of five are significantly different. With five tokens of each word in hand, it is easyto calculate the mean and standard error of each parameter at each time frame. We thendetermine the 95% confidence intervals of the distributions of the five data points taken from eachword for each parameter: where the confidence intervals from the two words are disjoint, we arealmost certain that there is a significant acoustic difference between the words.However, for this method to work at all, it is first necessary to precisely align in time all 10

tokens of a given word pair. To understand the reason for this, consider the contrast between betand bed. As is well established, the vowel of bed is usually about 11

2times as long as that in bet.

Therefore, the algorithm just described would easily discover that the latter quarter of the vowelof bed is significantly different from the stop closure portion of bet. But this is not veryinformative: it would be preferable to find out how the end of the vowel of bet may differ from thevowel of bed, and how their final consonants differ. As the old saying has it, it is necessary tocompare apples with apples.Therefore, the parameters of the 10 tokens in a given word pair were each warped against all the

others, using dynamic time warping software developed in-house (Slater & Coleman, 1996). Fromthis, the overall warp distance between each pair of tokens is obtained. The sum of the warpdistances between each token and the other nine provides a measure of how far that token is from

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372 357

Page 8: Discovering the acoustic correlates of phonological contrasts

ARTICLE IN PRESS

Table 2

Word pairs and phonological contrasts examined in this study

Vowel contrasts

[back] a/e brash/brush, thrash/thrush, grab/grub, clang/clung,

drank/drunk, rang/rung, shrank/shrunk, spank/spunk,

sprang/sprung, stank/stunk

=7/e= blur/blare, fur/fair, myrrh/mare, purr/pair

[back, round] 3/e choc/check, shod/shed, grog/Greg, vox/vex

i7/u7 G/Jew, flee/flew, ghee/goo

i=/R= dear/dour, tear/tour

[high] =R/=7 O/err

e/i crept/crypt, head/hid

[long] 3/L7 shod/shored

[low] a/e at/ate, thrash/thresh

=7/>7 err/ah

[low, back] >i/ei Guy/gay

[low, round] 3/e golf/gulf

>7/L7 ah/awe

[round] =7/L7 err/awe, spur/spore

Consonant contrasts

[anterior] s/P said/shed, sear/shear, sift/shift, sigh/shy, sire/shire, suit/shoot

d/dW D/G, day/J, dear/jeer, dough/Jo, aid/age

t/tP tare/chair, tear/cheer, tor/chore, eight/H

[back] h/P hair/share

[continuant] P/tP bush/butch

[coronal] j/v thee/V

l/w Glen/Gwen

[labial] b/g B/ghee

[lateral] j/l they/lay

j/l yaw/law

l/a blush/brush, cloud/crowd, clue/crew, flay/fray, flee/free,

flesh/fresh, flogs/frogs, flow/fro, fly/fry, glade/grade

[nasal] b/m B/me

l/n belled/bend

[nasal, round] m/w smell/swell

[voice] j/y loathe/loath, mouthV/mouthN, sheathe/sheath, soothe/sooth,

thy/thigh

b/p pub/pup, robe/rope, tab/tap, tribe/tripe

bd/pt cribbed/crypt, mobbed/mopped, webbed/wept

d/t D/tea, add/at, aid/eight, bleed/bleat, brood/brute, cloud/clout,

dwelled/dwelt, glowed/gloat, grade/great, hide/height,

learned/learnt, lent/lend, lied/light, made/mate, mend/meant,

nude/newt, played/plate, plead/pleat, plod/plot, slide/sleight,

smelled/smelt

dW/tP liege/leech, splodge/splotch

dz/ts leads/lets

g/k league/leak

s/z less/Les

J. Coleman / Journal of Phonetics 31 (2003) 351–372358

Page 9: Discovering the acoustic correlates of phonological contrasts

the remainder. The token with the lowest sum of warp distances from the other nine tokens wastaken to be a centroid token for the word pair, and the remaining nine tokens were then warped tothat centroid.Following that, the 95% confidence intervals for the five repetitions of every parameter at each

5ms frame of both words in a pair were calculated from the mean, the standard error, and the t-distribution (cf. e.g., Butler, 1985, p. 61):

(4) confidence intervals=mean7t� standard error

A critical value of t=2.776 was employed, for a confidence threshold of a=0.05 (2 tailed,df=4). Note that the t distribution is intrinsically conservative for small samples. When multipleframe-by-frame comparisons are made in this way, it is statistically respectable to apply aBonferroni correction, to take account of the greater likelihood of false positives arising fromrepeated comparisons. However, West (2000b) found that to do so reduces the resolution of thetechnique beyond use: that is, only the obvious local differences are found, and nonlocaldifferences are lost. Therefore, rather than use a Bonferroni correction, I adopted an alternativestrategy: only considering frames where the confidence intervals of the two groups of word tokenswere disjoint, i.e., where the lower confidence limit of one word’s tokens is greater than the upperconfidence limit of the other word’s tokens. Note that this is a much stricter constraint than theusual requirement that the mean of one sample lies outside the confidence limits of the other.There will still, inevitably, be some false positives. The second control condition (lap vs. Lapp) isintended to show the magnitude of that problem. More importantly, however, it remainsnecessary to go back to the original recordings and check the discovered contrasts on a case-by-case basis. This ‘‘quality control’’ process led to various apparent contrasts being called intoquestion: they are therefore not reported below. All the contrasts that are reported here have beenverified by acoustic analysis of the original audio files, case by case.

4. Results

4.1. Control comparisons

1. pat vs. put (a categorial difference between two minimally different vowels). Differences werefound in all 15 spectral parameters. The differences were located in the burst and aspiration portionof the /p/, in the vowel, including the transitions out of /p/ and into /t/, and in the following schwa.Fig. 2 illustrates the significant differences in parameter a3 (energy at multiples of 4 kHz).2. lap vs. Lapp. These words are expected to be homophones, so we would hope to find no

significant differences between their pronunciations. In fact, 34 idiosyncratic differences werefound, in 11 out of the 20 parameters, in 23 different frames. This provides some indication of thenumber of statistically significant differences that may be phonologically insignificant: the mean is1.48 differences per frame, in frames where a difference was found, and the maximum (in thiscase) is 4 differences per frame. Thus, differences in one to four parameters at a single frameshould be evaluated cautiously.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372 359

Page 10: Discovering the acoustic correlates of phonological contrasts

3. lap vs. wrap, using the same recordings as those examined by West (2000b), experiment three.For these recordings, as West has described, the subject was wearing an electropalatographypalate, which slightly interfered with his natural speech. Electromagnetic mid-sagittalarticulography data of the same sentences were obtained from the same speaker, again whilewearing an EPG palate. The words lap and wrap were spoken in the carrier frame ‘‘Have youuttered a—at home?’’ Acoustic analysis of those sound recordings is consistent with the EPG andEMA data in showing a number of significant differences in the pronunciation of the /l/ and /a/,the preceding and following /=/ (i.e., the preceding unstressed a and the unstressed vowel in at),and in the /t/, /=/ and /d/ of the second syllable of uttered. In short, West showed that thephonological contrast between /l/ and /a/ was distributed over a long portion of the sentence,‘‘(u)ttered a lap/wrap a(t)’’. (The u and t are in parentheses, as West found no significantdifferences there.)My analysis also found many differences in the acoustic parameters of lap and wrap, though

they were not as extensive as the differences found by West. This is likely to be due to the fact thatWest sought differences between two sets of words differing in initial /l/ vs. /a/, using amultivariate general linear model, whereas the analysis described here is conducted on just fivetokens of each of the single words lap and wrap.In my analysis, significant differences were found in the liquids, the preceding and following

vowels, and in the burst and aspiration of the final /p/. No differences were found in the wordsuttered and at. This shows that the technique described in this paper is not as powerful as careful

ARTICLE IN PRESS

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5time (s)

Arb

itrar

y un

its

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5time (s)

Arb

itrar

y un

its

t h e

p h

a t h e

Fig. 2. Above: 95% confidence intervals of variation in a3 (multiples of 4 kHz). Red solid lines: pat; blue dashed lines:

put; black dotted line: portions in which the confidence intervals are disjoint. Below: waveform of the centroid, ‘‘yter

pat agy’’. The time scale of both panels is that of the centroid.

J. Coleman / Journal of Phonetics 31 (2003) 351–372360

Page 11: Discovering the acoustic correlates of phonological contrasts

manual analysis of a set of contrasting words. Though this is slightly disappointing, it can also beviewed more positively as a rather conservative analysis method. Any differences that it doesreveal are likely to be quite easy to perceive, compared to the more subtle perceptual differencesfound by West or Heid and Hawkins.

4.2. Local contrasts

The algorithm proved to be extremely effective at locating differences between minimal pairs.For the most part, numerous acoustic differences were found for all of the minimal pairsexamined: a summary of the local correlates of each contrast is given in Appendix A. It can beseen from this data that (i) the algorithm works as expected, and (ii) the differences are consistentwith the prior literature on acoustic phonetics. Appendix A includes ‘‘intrinsic’’ segmentaldifferences and local coarticulation, i.e., coarticulation between adjacent segments. The main partof this paper focuses on nonlocal contrasts.

4.3. Nonlocal contrasts: consonants

Two features in particular were found to be distinguished in some word pairs by ratherdistributed, nonlocal contrasts: [voice] (especially in word-final position) and [anterior].

4.3.1. [voice]

Final j/h in mouthV/mouthN differed from loath/loathe in manifesting differences in /aR/ and thepreceding /=/, but not the /m/. There were extensive spectral differences in the preceding /=/,around 1.3–1.5 and 3.2–8 kHz, as well as in the tilt parameter, a measure of the balance of high-frequency to low-frequency energy. The spectrum of /=/ has more energy before mouthV thanbefore mouthN, especially above 4 kHz, indicating that /=/ is breathier before mouthN than mouthV,perhaps in anticipation of the voicelessness of the final /y/.Similarly, though less distinctly, spectral differences were found in the /Pi/ portion of the

sheathe/sheath contrast.Final b/p: In pub/pup, the aspiration of the final /p/ (manifest in the f0, p(voice) and RMS

parameters) contrasts with its absence for final /b/. Additionally, an unexpected difference wasfound in the spectrum of the initial /p/ aspiration: peak F1 is lower in pup (mean of 983Hz, vs.mean of 1071Hz in pub). As a consequence, a15 (1 kHz) amplitude is lower for pub than pup.From a strictly segmental perspective, this is a nonlocal difference, because of the intervening

vowel, though as the aspiration portion is co-produced with the vowel, the spectral differencesobserved in the aspiration might perhaps be attributed to the vowel. Impressionistically, the vowelsounds a little breathier in pup than pub, perhaps because it is shorter, and voicing is not sostrongly established.In robe/rope, local b/p differences are manifest in the source parameters. In addition,

broadband differences in the amplitude of the aspiration of the /t/ of utter, the word precedingrobe/rope, were discovered. The explanation for this distant difference is, however, obscure. Nosuch difference was found in tab/tap (see Appendix A).Final d/t: In addition to many very local distinctions around the release of the final stop and

beginning of the following vowel in learned/learnt, anticipatory differences were found in the

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372 361

Page 12: Discovering the acoustic correlates of phonological contrasts

preceding /n/ (energy at multiples of 1.06 kHz) and in the word-initial /l/ (energy at multiplesof 1.3 kHz). Similarly, in newt/nude, differences were found in the preceding /n/ and /ju/. Inslide/sleight, various small differences were found dotted about the preceding /sl>i/ portion (e.g.spectral differences in the /s/ at 5.3 kHz and multiples, and in /l/ at 1.2 kHz and multiples). Insmelled/smelt, the early portion /t=s–e/ (but not the /m/ or /l/) showed numerous small differences(e.g. the /s/’s differed in the 2.3–2.7 kHz range). All of these longer-distance differences were subtleand localized to one or a few frames, though. Similarly, in final dW/tP in liege/leech, differenceswere found in the preceding /li7/ and following /=/ (Fig. 3).Taken together, these results are consistent with, and supportive of, earlier findings that the

phonetic properties of the word-final voicing contrast may be manifest in word-initial sonorantsand even in the schwa of the preceding word.

4.3.2. [anterior]

Probably the most noteworthy result to come out of this study is the finding of properties of the[anterior] contrast (s/P, t/tP and d/dW) in word-initial position distributed across the precedingvowel and manifest on the earlier consonant /t/. Four word-pairs with initial s/P show this effect,to different degrees: said/shed, sear/shear, sift/shift and sigh/shy.Many spectral differences were found between said and shed. Differences between /s/ and /P/

and between the preceding /=/ qualities of the two words were found in many of the spectralparameters. There were also significant differences on three parameters at the preceding /t/–/=/boundary. As shown in Fig. 4, the distinction between /s/ and /P/ is also evident to a reduced

ARTICLE IN PRESS

Fig. 3. A long-distance difference in liege vs. leech, parameter a8 (multiples of 1.8 kHz). The earliest difference, shown

by a peak in the black dotted line (marked by an arrow), lies in the center of the word-initial [l].

J. Coleman / Journal of Phonetics 31 (2003) 351–372362

Page 13: Discovering the acoustic correlates of phonological contrasts

extent during the preceding /=/ (though the confidence limits are not disjoint), and more evident atthe /t/–/=/ boundary. The direction of the difference is consistent with the /s/–/P/ difference,suggesting that the friction of the /t/ release is more anterior (/s/-like) before an upcoming /s/ andmore posterior (/P/-like) before an upcoming /P/.In sear/shear, the pattern was similar: /s/ and /P/ were clearly distinct in many parameters, the

contrast extending to the preceding /=/ (especially towards the end), the transition from /=/ to /s/or /P/, the transition into /i/, through to the middle of /i=/. The earliest manifestation of thecontrast is in the preceding /t/ aspiration, in energy at multiples of 5.3 kHz.A similar difference in the preceding /t/ aspiration and burst was found in sift/shift (as well as

differences in the /s/ or /P/, the preceding /=/ and the start of the following /i/). The difference inthe /t/ aspiration is evident in the spectra around F4: before /s/, F4E3850Hz, whereas before /P/,F4E3674Hz. (These estimates of F4 frequency are simple averages, calculated by hand from LPCspectra of the /t/ aspiration portion.)The sigh/shy contrast was also evident in the preceding /t/ aspiration, the preceding /=/, the /s/

or /P/, and the start of the following />i/. The preceding /t/ aspiration has more energy in the2–3 kHz range (the spectral dip between F2 and F3) before /P/ than before /s/.The same kind of effect was found in dough/Jo and dear/jeer, though in the latter case a

perseverative difference was found on the following /t/. In dough/Jo, differences were foundthroughout /th=dW=R/. The initial /t/ aspiration was stronger before Jo than dough (i.e., tilt was

ARTICLE IN PRESS

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

time (s)

Arb

itrar

y un

its

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

time (s)

Arb

itrar

y un

its

t h e s e ε d

Fig. 4. Local and long-distance differences in said vs. shed, parameter a1 (8 kHz). The long-distance difference is

marked by an arrow.

J. Coleman / Journal of Phonetics 31 (2003) 351–372 363

Page 14: Discovering the acoustic correlates of phonological contrasts

higher, and there was more energy around 2.3 kHz). It is unclear to me why there would be a linkbetween stronger aspiration and upcoming [W].In dear/jeer differences were found in relating to the absence vs. presence of [W], in the preceding

/=/, and at the start of the following /i/. In addition, there were spectral differences in the following

/t/ burst, which had more energy below 5kHz after jeer than dear.

5. Discussion

The method described in Section 3 is extremely successful at revealing the acoustic correlates ofphonological contrasts, with two caveats. First, the incidence of false positives arising in theanalysis of homophones such as lap/Lapp deserves more systematic attention. Unfortunately, wefailed to foresee a need for inclusion of a variety of homophones when the database was originallycollected, with the result that this weakness remains to be addressed in future work. At present,though, it is apparent that some of the statistically significant differences that are found might notbe linguistically significant. They might even be imperceptible, in some cases. This fact made itnecessary to check all of the interesting contrasts in the usual way, i.e., by comparison of spectraat specific points in the signals, coupled with impressionistic listening to check whether thedifferences were audible and worth reporting. Even with this proviso, the fact remains that weexamined the speech of only a single speaker, so the generality or otherwise of our results to theEnglish-speaking population also needs to be considered in future work. The fact that the resultspresented here are consistent with earlier investigations into nonlocal correlates of variousphonological contrasts provides grounds for optimism that these results are not untypical ofEnglish.The second caveat is that the LPC filter coefficients are somewhat unintuitive. To an extent this

is a question of familiarity—even standard wide-band spectrograms seem exotic when firstencountered!—but it may be worthwhile examining the use of other, more intuitive parameters. Inparticular, it may be profitable to explore the use of parameter sets that take account of propertiesof the perceptual system, such as PLP (perceptual linear prediction) parameters (Hermansky,1990). However, the additional processing steps employed in deriving such parameters makesthem difficult to invert, i.e., to reconstruct re-synthesized versions of the time-warped recordings.In the experiments described in this paper, the easy invertibility of conventional linear predictionanalysis makes it easy to check that the time warping does not distort the recordings in anunnatural way. Furthermore, invertibility of the parameters enables us to claim (as in Sections 1and 2) that the parameterization employed constitutes a more-or-less complete encoding of theoriginal speech.The present data agree with the literature on local phonetic correlates of phonological

contrasts, and confirm recent findings of the long-domain correlates of the /l/–/a/ contrast and the[voice] contrast in coda position. The effects of these long-domain phenomena were of shorterextent in my data than in those of West (2000b) and Heid and Hawkins (2000), probably becausethe method used here is less sensitive than manual analysis of particular formant frequencies. Thepresent data and the technique employed to find obtain them are nevertheless valuable inidentifying differences over shorter and longer domains, and in analyzing local and long-domaineffects in the same study. The algorithm found local differences in all cases, i.e., intrinsic

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372364

Page 15: Discovering the acoustic correlates of phonological contrasts

segmental differences and coarticulatory effects on the preceding and following segments. Foridentifying nonlocal differences, the experiment was less successful. Some aspects of the realisationof [voice] on earlier consonants were discerned, and a new instance of consonant–consonantcoarticulation involving the [anterior] feature was established.3 ‘‘Phonetic influence’’, rather thanphonological (e.g., autosegmental) feature spreading is the appropriate metaphor here: note thatthe anticipation of an upcoming [–anterior] consonant such as /P/, evident in the aspiration of thepreceding /t/, does not turn that /t/ into a /tP/, as phonological feature spreading would imply.The effect of a vowel or consonant on preceding and following vowels and consonants is manifestas within-phoneme variation. In this respect, the phenomenon is very similar to descriptions of[anterior] consonant harmony (sometimes called ‘‘sibilant harmony’’) in the indigenous Americanlanguages Chumash (e.g., Beeler, 1970; Poser, 1982; Shaw, 1991), Tahltan (Shaw, 1991), Navajo(Kari, 1976), Chiricahua (cited by Mithun, 1999, p. 361) and the African language Zayse(Clements, 2000, p. 128). There has been some debate as to whether Chumash anterior harmonyshould be analyzed as the spreading of the phonological feature [anterior], as it may be a moregradient, noncategorical phenomenon (Russell, 1993, pp 146–150), rather like the Englishexamples presented here. When such data are taken together with more familiar examples of long-domain phonetic correlates of phonological contrasts (such as vowel harmony and ordinaryassimilation), it becomes apparent that phonological contrasts are not in general associated withsegment-sized stretches of speech. On the contrary, even ordinary ‘‘phonemic’’ contrasts arephonetically realized by a combination of short-time and more extended phonetic correlates.Some of the short-time correlates might well be termed ‘‘sub-segmental’’, in fact, for example, thevoice onset time difference of aspirated vs. unaspirated stops, or the frication portion of affricates,considered in opposition to stops. The traditional view of speech timing sees a roughly one-to-onemapping between phonemes and relatively short-time segments of speech; this study (and thosefrom which it grew) suggest an alternative, in which phonological features associated with specificplaces in structure (such as syllable and word positions) are realized in a distributed fashion as acomplex of phonetic properties, some of which may be rather short, some may be roughlysegment-sized, and some may be quite extended, over several syllables.This nonsegmental view opens up some new avenues for speech production research (e.g., West,

1999). In particular, it is natural to ask whether there is a correlation between the longer-domainfeatures and more sluggish articulators. For instance, West found that the long-domain acousticcorrelates of the /l/–/a/ contrast are due to extended differences of tongue dorsum and lip position.This is consistent with the findings of e.g., Kelly and Local (1986), Heid and Hawkins (2000). Onthe other hand, the long-domain correlates of the [voice] and [anterior] contrasts cannot easily beattributed to the inherent sluggishness of the articulators. The principal acoustic correlate of[voice], voice onset time, depends on rather fine-grained control of the timing of quite small andfast-moving articulatory structures in the larynx. The [anterior] contrast employs fine differencesin tongue-tip position, and hence involves an articulator that can move very quickly. Ladefoged(2001, p. 169) gives an example of very rapid voluntary tongue-tip movements in speech at a rateapproaching 7 closures per second. In view of such considerations, it seems unlikely to me thatlong-domain aspects of the [anterior] contrast can be explained solely or primarily in terms of

ARTICLE IN PRESS

3Hawkins and Smith (2001) present an instance of a similar phenomenon: differences in the /hu/ of who sharpened vs.

who’s sharpened.

J. Coleman / Journal of Phonetics 31 (2003) 351–372 365

Page 16: Discovering the acoustic correlates of phonological contrasts

kinetic constraints on the tongue tip. Another possibility, however, is that the extent ofanticipation of [–anterior] might be attributed to the inertia of the relatively massive tongue body,since [–anterior] consonants are produced with a relatively high tongue body position.Furthermore, the English [–anterior] consonants (/P/, /tP/, /W/, /dW/ and /a/) are produced bymany speakers with notable lip rounding (Brown, 1981): consequently, the extent of anticipationmight be a compensation for the inertia of the lips.The implications for exemplar-based models of speech perception are mixed. On the one hand,

the experiment shows that the acoustic correlates of phonological contrasts (including some verysubtle and distributed aspects of their realizations) can be discovered: this is, of course, good news.But, even though the method makes no presuppositions about where in the signal and in whichparameters the contrast resides, it does require the information that five tokens are versions of oneword, and five are of another word. How the human linguistic system does this is far from clear,though it should be remembered that the words in question have different meanings. An outline ofsuch an exemplar-based model was given in Coleman (2002); Fig. 5 presents an illustration of howa few similar words may be represented and discriminated from one another.In this figure, the phonetic representations of the different words car, carp and cart are paths

(i.e., trajectories) in a psychophysical space. The nodes labeled [k] and [>] have excitatoryconnections to all three semantic representations. (Excitatory connections between phonetics andsemantics are shown by dashed lines, and inhibitory connections by solid lines.) The later nodes ofthese word-form trajectories, where the forms diverge, are associated with distinct semanticrepresentations. The separation of form–meaning connections means that the difference in form isa phonological contrast. The formal locus of the contrast is toward the end of the two words.According to this model, learning phonology also requires learning meanings: phonologicalcontrasts cannot simply be inferred by phonetic clustering. How a child learns that different

ARTICLE IN PRESS

Fig. 5. Associations between semantic representations and paths in a phonetic space are necessary in order to encode

differences between words in an exemplar-based model.

J. Coleman / Journal of Phonetics 31 (2003) 351–372366

Page 17: Discovering the acoustic correlates of phonological contrasts

auditory percepts are associated with different meanings is a chicken-and-egg question that liesoutside the scope of the present study.

Appendix A. Local contrasts

A.1. Differences in vowels

[back]

a/e In brash/brush and thrash/thrush, spectral differences were found at the end of the vowels,around the transition to the /P/, and during /P/. In rang/rung and spank/spunk, spectral differenceswere only found in the vowel, but in clang/clung, drank/drunk, shrank/shrunk, slang/slung, sprang/sprung and stank/stunk, there were also differences in the following /F/. In grab/grub and shrank/shrunk, the liquids were a little different. The overall pattern of this group of words was thatspectral differences are most evident in the latter part of the vowel, in the following nasal, andsometimes in the preceding liquid. Obstruents preceding the liquid also exhibit slight differences, afact that I regard as an extended case of consonant-vowel coarticulation, though it is worthpointing out that the initial obstruents in grab/grub, shrank/shrunk, spank/spunk and stank/stunkare not adjacent to the vowels, as a linear, phonemic representation of these words would have it.(However, they are within the 200–250ms ‘‘window of coarticulation’’ advanced by Fowler andSaltzman (1993).)=7/e= Very few differences were found between blur and blare, only differences in the vowel

spectra for a few frames. In fur/fair, differences in the vowels were found for many spectralparameters across the whole vowel. Purr/pair exhibited only slight differences in the vowel, as wellas in the aspiration of the preceding /p/. No differences were found in the vowels of myrrh/mare,but there were many differences in the following /t/.

[back, round]

3/e In choc/check, there were extensive spectral differences between the vowels, the [P] part ofthe preceding affricate, and the closure and release of the following /k/. In shod/shed, there weredifferences in /P/, vowel and following /=/. In vox/vex, the only differences of note were in thevowel spectra, but in grog/Greg, the final /g/’s were different too.i7/u7 In G/Jew, extensive spectral differences were found, as expected, in the vowels. In flee/flew,

unusually, hardly any differences were discerned in the vowels, but there were many spectraldifferences in the middle of the preceding /l/, especially broadband spectral differences below2.3 kHz. The following /=/ also exhibited some small spectral differences. In ghee/goo, there werevarious spectral differences in the vowel, the /g/ release, and the preceding /=/.i=/R= In dear/dour, F2 of the preceding /=/ is lower (mean of 1612Hz) before dour than before

dear (mean F2(=)=1735Hz). This is a fairly standard example of V–V coarticulation. However, intear/tour, no parallel distinction was found, though the aspiration of the following /t/ was different.

[high]

=R/=7 In O/err, numerous spectral differences were found in the off-glides of the diphthong, aswould be expected.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372 367

Page 18: Discovering the acoustic correlates of phonological contrasts

e/i In crept/crypt and head/hid, the algorithm found very few differences at all, fewer than in thelap/Lapp control condition.

[long]

3/L7 In shod/shored, any durational cues to the [long] distinction would be obliterated by thetime warping employed in the alignment of tokens of the two words. Nevertheless, numerousdifferences in the vowel spectra were found.

[low]

a/e Very few differences were found between at/ate and thrash/thresh.=7/>7 In err/ah, many spectral differences were found in the vowels. For 3 of the 15 spectral

parameters, differences in the aspiration of the following /t/ were also found.>i/ei Although Guy and gay differ phonemically in the first part of their diphthongs—

differences that were discovered in spectral coefficients a11 to a15 (c. 1–1.3 kHz)—sourcedifferences were found in the off-glide /i/ and the following schwa.

[low, round]

3/e In golf/gulf, no differences were found in the vowels and only sporadic differences werefound in the consonants.>7/L7 In ah/awe, there were clear differences in the vowel spectra above 1.2 kHz, but no other

differences of note.

[round]

=7/L7 In err/awe, very many differences were found in the vowels, and in the aspiration of thefollowing /t/. In spur/spore, apart from some differences in the vowel qualities, a small (c. 40Hz)difference in F2 was found in the initial /s/. In spore, the mean F2 of /s/ was 1624Hz, whereas inspur, the mean F2 of /s/ was 1664Hz.

A.2. Local contrasts in consonants

[anterior]

s/P Differences were found between the /s/ and /P/ of sire/shire and suit/shoot. In sire/shire,differences were also found at the beginning of the vowel, and in the preceding /=/.d/dW Spectral differences were found relating to the absence or presence of [W] in D/G, day/J,

and aid/age. In D/G, differences were also found at the start and end of the /i7/, and in aid/age inthe transition into the following /=/.t/tP Differences in the [P] and following vowels were found in tare/chair, tear/cheer, tor/chore

and eight/H. In tor/chore and eight/H, the preceding vowels also showed the difference (eventhough, segmentally speaking, they are not adjacent to the [P]). Examination of spectrograms oftor and chore suggests that the differences in /L/ above 5.3 kHz appear to be due to high frequencynoise from [P] spilling over into the vowel.

[back] in an initial consonant contrast, h/P.Unsurprisingly, in hair/share, differences were foundin many parameters during and adjacent to the initial consonant.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372368

Page 19: Discovering the acoustic correlates of phonological contrasts

[continuant]

P/tP bush/butch differ in that bush has friction from the end of the vowel to the end of the word,whereas butch has a stop closure interval after the vowel, before the friction commences. Thealgorithm had no trouble discovering this difference in the RMS amplitude parameter and in thespectral parameters, as well as other spectral differences in the preceding and following vowelsand the final friction.

[coronal]

j/v In thee/V, it was not surprising to find spectral differences between /j/ and /v/, as well asdifferences at the end of the preceding /=/ (below 1.5 kHz) and at the beginning of the following/i7/ (across the spectrum, from 1.06 to 8 kHz).l/w In Glen/Gwen, there was less energy in the 1.06–1.5 kHz (F2) range in the preceding /=/

before Gwen than Glen. This may be a reflex of the greater degree of lip rounding in anticipationof the /w/ in Gwen than in Glen. Differences were also found in the initial cluster, as expected.

[labial]

b/g In B/ghee, differences were found in almost every parameter. As well as spectral differencesintrinsic to the closure and release of the stops, the previous /=/ and following /i7/ were differenton many spectral parameters.

[lateral]

j/l In they/lay, differences in many spectral parameters were found in the initial consonant, thepreceding /=/ and the early part of the diphthong. In addition, there was a difference in the RMSamplitude of /j/ and /l/.j/l Likewise, in yaw/law there were spectral differences in the initial sonorant, the preceding /=/

and the early part of the following /L7/.l/a In blush/brush, most differences were found in the /b/ release, as well as some spectral

differences in the preceding /=/ (which, note, is not actually adjacent to the liquid), the liquid andthe following vowel. In cloud/crowd, spectral differences above 2.7 kHz were found in the liquids.There were spectral differences at the /k/ release in the 1.3–2.3 kHz region, and in the /k/aspiration from 3.2 to 8 kHz. There were also spectral differences in the preceding /=/, eventhough it is not adjacent to the liquid. In clue/crew, the main differences were in the liquid and thepreceding aspiration. In flay/fray, flee/free, flesh/fresh, flow/fro, fly/fry, and glade/grade, there weredifferences in the liquid, the preceding consonant and the following vowel. In flogs/frogs, however,the main differences were in the /f/.

[nasal]

b/m As well as differing in nasality, B and me contrast in continuance. Furthermore, althoughthe initial stop of B is usually categorized as [+voice], it is rarely spoken with vocal cord vibrationin English. Consequently, the algorithm found differences between initial /b/ and /m/ in severalsource parameters, as well as spectrally. The spectral differences were found in the consonant andthe beginning of the vowel, right across the frequency range (i.e., up to 8 kHz). The RMSamplitude during /m/ is greater than /b/, and it has a higher f0, as voicing is sustained throughoutthe nasal.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372 369

Page 20: Discovering the acoustic correlates of phonological contrasts

l/n In belled/bend, there were few differences of note apart from a few spectral differences in thesubsequent /=/.

[nasal, round]

m/w Smell and swell differ in lip rounding as well as nasality, but perhaps more importantly,/m/ has oral closure. Several of the tokens of smell in the database were spoken witha short interval of complete closure, with no voicing, between the /s/ and /m/, with theconsequence that all of the source parameters showed differences relating to the closure intervalbetween /m/ and /w/.

[voice]

In j/h, little evidence of coarticulation was found. In loathe/loath, soothe/sooth and thy/thigh,source and spectral differences were mostly found in the /j/ or /y/.

Final b/p in tab/tap were differentiated by source parameters only, reflecting the voicingdistinction. Similarly, in tribe/tripe, hardly any differences at all were found, except in f0 andp(voice)—the ESPS binary-valued ‘‘probability of voicing’’ estimate, with values 0 or 1—duringthe labial closure.bd/pt In cribbed/crypt, mobbed/mopped and webbed/wept, source and spectrum differences

were largely restricted to the closure interval of the entire cluster, with only a few differencesin the preceding vowels. Few differences were found in the release of the final /t/, but the f0,p(voice) and RMS parameters during the closure were found to be different, in accordance withthe fact that /bd/ is voiced and /pt/ voiceless.In initial d/t in D/tea, numerous differences were found in the aspiration of the plosives, as well

as broadband spectral differences at the end of the preceding /=/. The aspiration of the following/t/ was different at around 1.2 kHz. Although this difference was small, it is consistent with otherevidence of coronal consonant ‘‘harmony’’, detailed in Section 4.3.The final d/t contrast in add/at, aid/eight, bleed/bleat, cloud/clout, dwelled/dwelt,

glowed/gloat, grade/great, hide/height, lent/lend, lied/light, made/mate, mend/meant,nude/newt, played/plate, plead/pleat, plod/plot, slide/sleight and smelled/smelt was manifest indifferences in the preceding vowels (or preceding laterals in dwelled/dwelt), the finalconsonant release and aspiration, and sometimes in the following /=/. In lied/light, the earliestdifferences found were in the middle of the />i/. In brood/brute, there were only differences in thefinal /t/ or /d/.In mend/meant, although differences in the same few frames of the preceding /e/ were found in

six of the 15 spectral parameters (1–1.1 and 2.7–4 kHz), curiously, no differences were found in theintervening /n/!In agreement with the finding of van Santen et al. (1992) and Hawkins and Nguyen (in press a,

2004) that the secondary articulation of initial /l/ may differ according to the voicing of finalobstruents, small differences were found in the onset /l/ of bleed/bleat, leads/lets, learned/learnt,lend/lent, slide/sleight and splodge/splotch. However, these are amplitude differences at specificfrequencies, and are much subtler than the audible clear/dark differences in /l/ that have beenpreviously noted. Whether these findings should therefore be disregarded (as an unperceivablecontrast, or as spurious) or celebrated (as showing the resolving power of the analysis method) is

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372370

Page 21: Discovering the acoustic correlates of phonological contrasts

unclear to me. No differences were found in the initial /l/ of league/leak or less/Les, thoughdifferences in the vowel and the following /=/ were found in both cases, as expected.

References

Beeler, M. S. (1970). Sibilant harmony in Chumash. International Journal of American Linguistics, 36, 14–17.

Brown, G. (1981). Consonant rounding in British English: The status of phonetic descriptions as historical data.

In R. E. Asher & E. J. A. Henderson (Eds.), Towards a history of phonetics (pp. 67–76). Edinburgh: Edinburgh

University Press.

Butler, C. (1985). Statistics in linguistics. Oxford: Blackwell.

Bybee, J. (2000). Lexicalization of sound change and alternating environments. In M. B. Broe, & J. B. Pierrehumbert

(Eds.), Papers in laboratory phonology V: Acquisition and the lexicon (pp. 250–268). Cambridge: Cambridge

University Press.

Chen, M. (1970). Vowel length variation as a function of the voicing of the consonant environment. Phonetica, 22,

129–159.

Clements, G. N. (2000). Phonology. In: B. Heine, & D. Nurse (Eds.), African languages: An introduction (pp. 123–160).

Cambridge: Cambridge University Press. (Chapter 6)

Coleman, J. (2002). Phonetic representations in the mental lexicon. In J. Durand, & B. Laks, (Eds.), Phonetics,

phonology and cognition (pp. 96–130). Oxford: Oxford University Press.

D!emonet, J.-F., Thierry, G., & Nespoulous, J.-L. (2002). Towards imaging the neural correlates of language functions.

In J. Durand & B. Laks, (Eds.), Phonetics, phonology, and cognition (pp. 244–253). Oxford: Oxford University Press.

Firth, J.R., & Rogers, B. B. (1937). The structure of the Chinese monosyllable in a Hunanese dialect (Changsha);

Reprinted in J. R. Firth, Papers in linguistics 1934–1951 (pp. 76–91). Oxford: Oxford University Press.

Fowler, C. A., & Saltzman, E. (1993). Coordination and coarticulation in speech production. Language and Speech, 36,

171–195.

Gaskell, M. G., & Marslen-Wilson, W. D. (1996). Phonological variation and inference in lexical access. Journal of

Experimental Psychology: Human Perception and Performance, 22, 144–158.

Goldinger, S. D. (1997). Words and voices: perception and production in an episodic lexicon. In K. Johnson, &

Mullenix, J. W., (Eds.), Talker variability in speech processing (pp. 33–66). San Diego: Academic Press.

Hawkins, S., & Nguyen, N. (in press a). Effects on word recognition of syllable-onset cues to syllable-coda voicing.

In J. K. Local, R. A. Ogden, & R. A. M. Temple (Eds.), Papers in laboratory phonology VI. Cambridge: Cambridge

University Press.

Hawkins, S., & Nguyen, N. (2004). Influence of syllable-coda voicing on the acoustic properties of syllable-onset /l/ in

English. Journal of Phonetics, 32, doi:10.1016/S0095-4470(03)00031-7.

Hawkins, S., & Slater, A. (1994). Spread of CV and V-to-V coarticulation in English. Proceedings of the third

international conference on spoken language processing, Vol. 1. (pp. 57–60).

Hawkins, S., & Smith, R. (2001). Polysp: A polysystemic, phonetically-rich approach to speech understanding. Italian

Journal of Linguistics—Rivista di Linguistica, 13, 99–188.

Heid, S., & Hawkins, S. (2000). An acoustical study of long domain /r/ and /l/ coarticulation. Proceedings of the fifth

seminar on speech production: models and data, and CREST workshop on models of speech production: motor planning

and articulatory modelling. Munich: Institut f .ur Phonetik und Sprachliche Kommunikation, Ludwig-Maximilians-

Universit.at (pp. 77–80).

Henderson, E. J. A. (1948). Prosodies in Siamese: A study in synthesis. Reprinted in F.R. Palmer (Ed.), (1970). Prosodic

analysis. Oxford: Oxford University Press.

Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of

America, 87, 1738–1752.

Hooper, J. B. (1977). Substantive evidence for linearity: Vowel length and nasality in English. In W. A. Beach, S. E.

Fox, & S. Philosoph (Eds.), Papers from the 13th Regional Meeting, Chicago Linguistic Society (pp. 152–164).

Hooper, J. B. (1981). The empirical determination of phonological representations. In T. Myers, J. Laver, &

J. Anderson (Eds.), The cognitive representation of speech (pp. 347–357). Amsterdam: North-Holland.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372 371

Page 22: Discovering the acoustic correlates of phonological contrasts

Johnson, K. (1997) Speech perception without speaker normalization. In K. Johnson & Mullenix, J. W. (Eds.), Talker

variability in speech processing (pp. 145–165). San Diego: Academic Press.

Kari, J. H. (1976). Navaho verb prefix phonology. New York: Garland.

Kelly, J., & Local, J. K. (1986). Long domain resonance patterns in English. In International conference on speech input/

output; techniques and applications. (pp. 304–9). Conference Publication No. 258. London: Institute of Electrical

Engineers.

Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America,

67, 971–995.

Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82,

737–793.

Ladefoged, P. (2001). Vowels and consonants: An introduction to the sounds of language. Oxford: Blackwell.

Lahiri, A., & Hankamer, J. (1988). The timing of geminate consonants. Journal of Phonetics, 16, 327–338.

Lisker, L. (1986). ‘‘Voicing’’ in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language

and Speech, 29, 3–11.

Mithun, M. (1999). The languages of native North America. Cambridge: Cambridge University Press.

Olive, J.P., Greenwood, A., & Coleman, J. S. (1993). Acoustics of American English speech: A dynamic approach.

New York: Springer.

Pisoni, D. B. (1997a). Some thoughts on ‘‘normalization’’ in speech perception. In K. Johnson &Mullenix, J. W. (Eds.),

Talker variability in speech processing. (pp. 9–32). San Diego: Academic Press.

Pisoni, D. B. (1997b). Perception of synthetic speech. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg,

(Eds.), Progress in speech synthesis. (pp. 541–560). New York: Springer.

Poser, W. J. (1982). Phonological representations and ‘‘action-at-a-distance’’. In H. van der Hulst, & N. Smith. (Eds.),

The structure of phonological representations, Part II. Dordrecht: Foris.

Pratt, R. L. (1986). On the intelligibility of synthetic speech. Proceedings of the Institute of Acoustics, 8(7), 183–192.

Russell, K. (1993). A constraint-based approach to phonology and morphology. Ph.D. thesis, University of Southern

California.

Shaw, P. A. (1991). Consonant harmony systems: The special status of coronal harmony. In C. Paradis, & J.-F. Prunet

(Eds.), Phonetics and phonology, Vol. 2: The special status of coronals: internal and external evidence. New York:

Academic Press.

Slater, A., & Coleman, J. (1996). Non-segmental analysis and synthesis based on a speech database. In H. T. Bunnell, &

W. Idsardi, (Eds.), Proceedings of ICSLP 96, fourth international conference on spoken language processing, Vol. 4.

(pp. 2379–2382).

Stevens, K. (1998). Acoustic phonetics. Cambridge, MA: MIT Press.

Tunley, A. (1999). Coarticulatory influences of liquids on vowels in English. Unpublished Ph.D. dissertation, University

of Cambridge.

van Santen, J. P. H. (1997). Segmental duration and speech timing. In Y. Sagisaka, N. Campbell, & N. Higuchi, (Eds.),

Computing prosody: computational models for processing spontaneous speech. (pp. 225–249). New York: Springer.

van Santen, J. P. H., Coleman, J. S., & Randolph, M. A. (1992). Effects of postvocalic voicing on the time course of

vowels and diphthongs. Journal of the Acoustical Society of America, 92(4) Part 2, 2444.

West, P. (1999). The extent of coarticulation of English liquids: An acoustic and articulatory study. Proceedings of the

international conference of phonetic sciences, San Francisco. (pp. 1901–1904).

West, P. (2000a). Perception of distributed coarticulatory properties of English /l/ and /a/. Journal of Phonetics, 27,

405–425.

West, P. (2000b). Long-distance coarticulatory effects of English /l/ and /a /. Unpublished D.Phil. thesis, University of

Oxford.

ARTICLE IN PRESS

J. Coleman / Journal of Phonetics 31 (2003) 351–372372


Recommended