+ All Categories
Home > Documents > The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech...

The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech...

Date post: 02-Aug-2018
Category:
Upload: phungtu
View: 215 times
Download: 0 times
Share this document with a friend
16
Haskins Laboratories Status Report on Speech Research 1989, SR-99/l00, 102-117 The Perception of Phonetic Gestures* Carol A. Fowler t and Lawrence D. Rosenblum tt We have titled our presentation "The perception of phonetic gestures" as if phonetic gestures are perceived. By phonetic gestures we refer to organized movements of one or more vocal-tract structures that realize phonetic dimensions of an utterance (cf. Browman & Goldstein, 1986; in press a). An example of a gesture is bilabial closure for a stop, which includes contributions by the jaw and the upper and lower lips. Gestures are organized into larger segmental and suprasegmental groupings, and we do not intend to imply that these larger organizations are not perceived as well. We focus on gestures to emphasize a claim that, in speech, perceptual objects are fundamentally articulatory as well as linguistic. That is, in speech perception, articulatory events have a status quite different from that of their acoustic products. The former are perceived, whereas the latter are the means (or one of the meanf;l) by which they are perceived. A claim that phonetic gestures are perceived is not uncontroversial, of course, and there are other points of view (e.g., Massaro, 1987; Stevens & Blumstein, 1981). We do not intend to consider these other views here, however, but instead to focus on agreements and disagreements between two theoretical perspectives from which the claim is made. Accordingly, we begin by summarizing some of the evidence that, in our view, justifies it. Phonetic gestures are perceived: Three sources of evidence 1. Correspondence failures between acoustic signal and percept: Correspondences between gestures and percept Perhaps the most compelling evidence that gestures, and not their acoustic products, are perceptual objects is the failure of dimensions of speech percepts to correspond to obvious dimensions of the acoustic signal and their correspondence, instead, to phonetically-organized articulatory behaviors that produce the signal. We offer three examples, all of them implicating articulatory gestures as perceptual objects and the third showing most clearly that the perceived Preparation of this manuscript was supported by a fellowship from the John Simon Guggenheim Memorial Foundation to the first author and by grants NIH·NICHD HD·01994 and NINCDS NS·13617 to Haskins Laboratories. 102 gestures are not surface articulatory movements, but rather, linguistically-organized gestures. a. Synthetic Idil and Idul One example from the early work at Haskins Laboratories (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967) is of synthetic Idil and Idul. Monosyllables, such as those in Figure 1, can be synthesized that consist only of two formants. The information specifying Idl (rather than fbi or Igl) in both syllables is the second formant transition. These transitions are very different in the two syllables, and, extracted from their syllables, they sound very different too. Each sounds more-or-Iess like the frequency glide it resembles in the visible display. Neither sounds like Id/. In the context of their respective syllables, however, they sound alike and they sound like Id/. The consonantal segments in Idi/ and Idu/ are produced alike too, by a constriction and release gesture of the tongue tip against the alveolar ridge of the palate. When listeners perceive the
Transcript
Page 1: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

Haskins Laboratories Status Report on Speech Research1989, SR-99/l00, 102-117

The Perception of Phonetic Gestures*

Carol A. Fowlert and Lawrence D. Rosenblumtt

We have titled our presentation "The perception of phonetic gestures" as if phoneticgestures are perceived. By phonetic gestures we refer to organized movements of one ormore vocal-tract structures that realize phonetic dimensions of an utterance (cf. Browman& Goldstein, 1986; in press a). An example of a gesture is bilabial closure for a stop, whichincludes contributions by the jaw and the upper and lower lips. Gestures are organizedinto larger segmental and suprasegmental groupings, and we do not intend to imply thatthese larger organizations are not perceived as well. We focus on gestures to emphasize aclaim that, in speech, perceptual objects are fundamentally articulatory as well aslinguistic.

That is, in speech perception, articulatory events have a status quite different from that oftheir acoustic products. The former are perceived, whereas the latter are the means (or oneof the meanf;l) by which they are perceived.

A claim that phonetic gestures are perceived is not uncontroversial, of course, and thereare other points of view (e.g., Massaro, 1987; Stevens & Blumstein, 1981). We do notintend to consider these other views here, however, but instead to focus on agreements anddisagreements between two theoretical perspectives from which the claim is made.Accordingly, we begin by summarizing some of the evidence that, in our view, justifies it.

Phonetic gestures are perceived:Three sources of evidence

1. Correspondence failures between acousticsignal and percept: Correspondencesbetween gestures and percept

Perhaps the most compelling evidence thatgestures, and not their acoustic products, areperceptual objects is the failure of dimensions ofspeech percepts to correspond to obviousdimensions of the acoustic signal and theircorrespondence, instead, to phonetically-organizedarticulatory behaviors that produce the signal. Weoffer three examples, all of them implicatingarticulatory gestures as perceptual objects and thethird showing most clearly that the perceived

Preparation of this manuscript was supported by afellowship from the John Simon Guggenheim MemorialFoundation to the first author and by grants NIH·NICHDHD·01994 and NINCDS NS·13617 to Haskins Laboratories.

102

gestures are not surface articulatory movements,but rather, linguistically-organized gestures.

a. Synthetic Idil and Idul

One example from the early work at HaskinsLaboratories (Liberman, Cooper, Shankweiler, &Studdert-Kennedy, 1967) is of synthetic Idil andIdul. Monosyllables, such as those in Figure 1, canbe synthesized that consist only of two formants.The information specifying Idl (rather than fbi orIgl) in both syllables is the second formanttransition. These transitions are very different inthe two syllables, and, extracted from theirsyllables, they sound very different too. Eachsounds more-or-Iess like the frequency glide itresembles in the visible display. Neither soundslike Id/. In the context of their respective syllables,however, they sound alike and they sound like Id/.

The consonantal segments in Idi/ and Idu/ areproduced alike too, by a constriction and releasegesture of the tongue tip against the alveolar ridgeof the palate. When listeners perceive the

Page 2: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 103

synthetic /diJ and /duJ syllables of Figure 1, theirpercepts correspond to the implied constrictionand release gestures, not, it seems, to the context­sensitive acoustic signal.

3000

2400 INJ:.5 1800>-uCGl

1200~

0-

\Gla:800

0di du

Figure 1. Synthetic syllables Idil and IduJ. The secondformant transitions identify the initial consonant as Idlrather than as fbI or IgJ.

b. Functional equivalence of acoustic"cues"

We expect listeners to be very good atdistinguishing an interval of silence fromnonsilence-from a set of frequency glides, forexample. Too, we expect them to distinguishacoustic signals that differ in two ways morereadily than signals that differ in just one of thetwo ways. Both of these expectations are vio­lated-another example of noncorrespondence-ifthe silence and glides are joint acoustic productsof a common constriction and release gesture for astop consonant.

Fitch, Halwes, Erickson and Liberman (1980)created synthetic syllables identified as "slit" or"split" by varying the duration of a silent intervalfollowing the fricative and manipulating the pres­ence or absence of transitions for a bilabial stopfollowing the silent interval. A relatively long si­lent interval and the presence of transitions bothsignal a bilabial stop, the silent interval cuing theclosure and the transitions the release. Fitch et al.found that pairs of syllables differing on both cuedimensions, duration of silence and pres­ence/absence of transitions, were either morediscriminable than pairs differing in one of theseways or less discriminable depending on how thecues were combined. A syllable with a long silentinterval and transitions was highly discriminablefrom a syllable with a shorter silent interval andno transitions; the one was identified as "split"and the other as "slit." A syllable with a short

silent interval and transitions was nearlyindiscriminable from one with a longer intervaland no transitions; both were identified as "split."Syllables differing in two· ways are indiscrim­inable just when the acoustic cues that distinguishthem are "functionally equivalent"-that is, theycue the same articulatory gesture. A long silentinterval does not normally sound like a set offrequency glides, but it does in a context in whicheach specifies a consonantal constriction.

c. Perception of intonation

The findings just summarized, among others,reveal that listeners perceive gestures.Apparently, listeners do not perceive the acousticsignal per see

Nor, however, do they perceive "raw"articulatory motions as such. Rather, theyperceive linguistically-organized (phonetic)gestures. Research on the various ways in whichfundamental frequency (henceforth, fo) isperceived shows this most clearly.

Perceived intonational peak height will not, ingeneral, correspond to the absolute rate at whichthe vocal folds open and close during production ofthe peak. Instead, perception of the peakcorresponds to just those influences on the rate ofopening and closing that are caused by gesturesintended by the talker to affect intonational peakheight. (Largely, intonational melody isimplemented by contraction and relaxation ofmuscles of the larynx that tense the vocal folds;see, e.g., Ohala, 1978.) There are other influenceson the· rate of vocal fold opening and closing thatmay either decrease or increase foe Some of theseinfluences, due to lung deflation during anexpiration ("declination," Gelfer, Harris, & Baer,1987; Gelfer, Harris, Collier, & Baer, 1985) or tosegmental perturbations reflecting vowel height(e.g., Lehiste & Peterson, 1961) and obstruentvoicing (e.g., Ohde, 1984), are largely or entirelyautomatic consequences of other things thattalkers are doing (producing an utterance on anexpiratory airflow, producing a close or open vowel[Honda, 1981], producing a voiced or voicelessobstruent rOhde, 1984]; LOfqvist, Baer, McGarr, &Seider Story, in press). They do not sound likechanges in pitch; rather, they sound like whatthey are: information for early-to-Iate serialposition in an utterance in the case of declination(Pierrehumbert, 1979; see also Lehiste, 1982), andinformation for vowel height (Reinholt Peterson;1986; Silverman, 1987) or consonant voicing (e.g.,Silverman, 1986) in the case of segmentalperturbations.

Page 3: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

104 Fowler and Rosenblum

As we will suggest below (under "How acousticstructure may serve as information for speechperceivers"), listeners apparently use config­urations of changes in different acoustic variablesto recover the distinct, organized articulatorysystems that implement the various linguisticdimensions of talkers' utterances. By usingacoustic information in this way, listeners canrecover what Liberman (1982) has called thetalker's "phonetic intents."

2. Audio-visual integration of gesturalinformationA video display of a face mouthing /gal

synchronized with an acoustic signal of thespeaker saying /hal is heard most typically as "da"(MacDonald & McGurk, 1978). Subjects'identifications of syllables presented in this typeof experiment reflect an integration of informationfrom the optical and acoustic sources. Too, asLiberman (1982) points out, the integration affectswhat listeners experience hearing to an extentthat they cannot tell what contribution to theirperceptual experience is made by the acousticsignal and what by the video display.l

Why does integration occur? One answer is thatboth sources of information, the optical and theacoustic, provide information apparently aboutthe same event of talking, and they do so byproviding information about the talkers' phoneticgestures.

3. ShadowingListeners' latency to repeat a syllable they hear

is very short-in Porter's research (Porter, 1978;Porter & Lubker, 1980), around 180 ms onaverage. Even though these latencies are obtainedin a choice reaction time procedure (in which thevocal response required is different for differentstimuli to respond), latencies approach simplereaction times (in which the same response occursto any stimulus to respond), and they are muchshorter than choice reaction times using a buttonpress.

Why should these particular choice reactiontimes be so fast? Presumably, the compatibilitybetween stimulus and response explains the fastresponse times. Indeed, it effectively eliminatesthe element of choice. If listeners perceive thetalker's phonetic gestures, then the only responserequiring essentially no choice at all is one thatreproduces those gestures.

The motor theoryThroughout most of its history, the motor theory

(e.g., Liberman, Cooper, Harris, & MacNeilage,

1963; Liberman et al., 1967; Liberman &Mattingly, 1985; see also, Cooper, Delattre,Liberman, Borst, & Gerstman, 1952) has been theonly theory of speech perception to identify thephonetic gesture as an object of perception. Herewe describe the motor theory by discussing what,more precisely, the motor theorists haveconsidered to be the object of perception, how theycharacterize the process of speech perception andwhy, recently, they have introduced the idea thatspeech perception is accomplished by a specializedmodule.

What is perceived for the motor theorist?

Coarticulation is the reason why the acousticsignal appears to correspond so badly to thesequences of phonemes that talkers intend toproduce. Due to coarticulation, phonemes areproduced in overlapping time frames so that theacoustic signal is everywhere (or nearlyeverywhere; see, e.g., Stevens & Blumstein, 1981),context-sensitive. This makes the signal a complex"code" on the phonemes of the language, not acipher, like an alphabet.2 In "Perception of thespeech code" (1967), Liberman and his colleaguesspeculated that coarticulatory "encoding" is, inpart, a necessary consequence of properties of thespeech articulators (their sluggishness, forexample). However, in their view, coarticulation isalso promoted both by the nature of phonemesthemselves-that they are realized by sets ofsubphonemic features3- and by the listener'sshort-term memory, which would be overtaxed bythe slow transmission rate of an acoustic cipher.

In producing speech, talkers exploit the fact thatthe different articulators-the lips, velum, jaw,etc.-can be independently controlled.Subphonemic features, such as lip rounding,velum lowering and alveolar closure each usesubsets of the articulators, often just one;therefore, more than one feature can be producedat a time. Speech can be produced at rapid ratesby allowing "parallel transmission" of thesubphonemic features of different phonemes. Thisincreases the transmission rates for listeners, butit also creates much of the encoding that isconsidered responsible for the apparent lack ofinvariance between acoustic and phoneticsegments.

The listener's percept corresponds, it seems,neither to the encoded cues in the acoustic signalnor even to the also-encoded succession of vocaltract shapes during speech production, but insteadto a sequence of discrete, unencoded phonemes,each composed of its own component subphonemic

Page 4: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 105

features. To explain why "perception mirrorsarticulation more closely than sound" (p. 453) and(yet) achieves recovery of discrete unencodedphonemes, the motor theorists proposed as a firsthypothesis that perceivers somehow access theirspeech-motor systems in perception and that thepercept they achieve corresponds to a stage inproduction before encoding of the speech segmentstakes place. In "Perception of the speech code," thestage was one in which "motor commands" to themuscles were selected to implement subphonemicfeatures. In "The motor theory revised,"(Liberman & Mattingly, 1985), a revision to thetheory reflects developments in our understandingof motor control. Evidence suggests that activitiesof the vocal tract are products of functionalcouplings among articulators (e.g., Folkins &Abbs, 1975, 1976; Kelso, Tuller, Vatikiotis­Bateson, & Fowler, 1984), which produce gesturesas defined earlier, not independent movements ofthe articulators identified with subphonemicfeatures in "Perception of the speech code." In"The motor theory revised," control structures forgestures have replaced motor commands forsubphonemic features as invariants of productionand as objects of perception for listeners.4 Likesubphonemic features, control structures areabstract, prevented by coarticulation from makingpublic appearances in the vocal tract. Libermanand Mattingly write of the perceptual objects ofthe revised theory:

We would argue, then, that the gestures do havecharacteristic invariant properties, as the motortheory requires, though these must be seen, not asperipheral movements, but as the more remotestructures that control the movements. Thesestructures correspond to the speaker's intentions.(p.23)

In recovering abstract gestures, processes ofspeech perception yield quite different kinds ofperceptual objects than general auditoryperception. In auditory perception, moregenerally, according to Liberman and Mattingly,listeners hear the signal as "ordinary sound"(p. 6); that is, they hear the acoustic signal assuch. In other publications, Mattingly and

. Liberman (1988) refer to this apparently morestraightforward perceptual object as"homomorphic" in contrast to objects of speechperception which are "heteromorphic." Anexample they offer of homomorphic auditoryperception is perception of isolated formanttransitions which sound like the frequency glidesthey resemble in a spectrographic display.

How perception takes place in the motortheory

In the motor theory, listeners use "analysis bysynthesis" to recover phonetic gestures from theencoded, informationally-impoverished acousticsignal. This aspect of the theory has never beenworked out in detail. However, in general,analysis by synthesis consists in analyzing asignal by guessing how the signal might have beenproduced (e.g., Stevens, 1960; Stevens & Halle,1964). Liberman and Mattingly refer to an"internal, innately specified vocal-tractsynthesizer that incorporates completeinformation about the anatomical andphysiological characteristics of the vocal tract andalso about the articulatory and acousticconsequences of linguistically significant gestures"(p. 26). The synthesizer computes candidategestures and then determines which of thosegestures, in combination with others identified asongoing in the vocal-tract could account for theacoustic signal.

Speech perception as modularIf speech perception does involve accessing the

speech-motor system, then it must indeed bespecial and quite distinct from general. auditoryperception. It is special in its objects of perception,in the kinds of processes applied to the acousticsignal, and presumably in the neural systemsdedicated to those processes as well. Libermanand Mattingly propose that speech perception isachieved by a specialized module.

A module (Fodor, 1983) is a cognitive systemthat tends to be narrowly specialized ("domainspecific"), using computations that are special("eccentric") to its domain; it is computationallyautonomous (so that different systems do notcompete for resources) and prototypically isassociated with a distinct neural substrate. Inaddition, modules tend to be "informationallyencapsulated," bringing to bear on the processingthey do only some of the relevant information theperceiver may have; in particular, processing of"input" (perceptual) systems-prime examples ofmodules-is protected early on from bias by "top­down" information.

The speech perceptual system of the motortheory has all of these characteristics. It isnarrowly specialized and its perception-productionlink is eccentric; moreover, it is associated with aspecialized neural substrate (e.g., Kimura, 1961).In addition, as the remarkable phenomenon ofduplex perception (e.g., Liberman, Isenberg, &Rakerd, 1981; Rand, 1974) suggests, the speech

Page 5: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

106 Fowler and Rosenblum

perceIvmg system is autonomous andinformationally encapsulated.

In duplex perception as it is typicallyinvestigated (e.g., Liberman et al., 1981; Mann &Liberman, 1983; Repp, Milburn, & Ashkenas,1983), most of an acoustic CV syllable (the "base"at the left of Figure 2) is presented to one earwhile the remainder, generally a formanttransition (either of the "chirps" on the right sideof Figure 2) is presented to the other ear. Heard inisolation, the base is ambiguous between "da" and"ga," but listeners generally report hearing "da."(It was identified as "da" 87% of the time in thestudy by Repp et al., 1983.) In isolation, the chirpssound like the frequency glides they resemble;they do not sound speech-like. Presenteddichotically, listeners integrate the chirp and thebase, hearing the integrated "da" or "ga" in the earreceiving the base. Remarkably, in addition, theyhear the chirp in the other ear. Researchers whohave investigated duplex perception describe it asperception of the same part of an acoustic signalin two ways simultaneously. If that character­ization is correct, it implies strongly that thepercepts are outputs of two distinct andautonomous perceptual systems, one specializedfor speech and the other perhaps general to otheracoustic signals.

A striking characteristic of speech perceptualsystems that integrate syllable fragmentspresented to different ears is their imperviousnessto information in the spatial separation of thefragments that they cannot possibly be part of thesame spoken syllable-an instance, perhaps, ofinformation encapsulation.

In recent work, Mattingly and Liberman (inpress) have revised, or expanded on, Fodor's viewof modules by proposing a distinction between"closed" and "open" modules. Closed modules,including the speech module and a sound­localization module, for example, are narrowlyspecialized as Fodor has characterized modulesmore generally. In addition (among other specialproperties), they yield heteromorphic percepts­that is, percepts whose dimensions are not thoseof the proximal stimulation. Although Mattinglyand Liberman characterize the heteromorphicpercept in this way-in terms of what it does notconform to, it appears that the heteromorphicpercept can be characterized in a more positiveway as well. The dimensions of heteromorphicpercepts are those of distal events, not of proximalstimulation. The speech module renders phoneticgestures; the sound-localization module renderslocation in space. By contrast, open modules aresort of "everything-else" perceptual systems.

Normal syllables

4"dn

N':x:

3

::,,:~~

.E>- 2uc•~

~I 00 50 250

lime (ms)

Duplex-producing syllables

"d" transitionor

"g" transition

Base Transitions

Figure 2. Stimuli that yield duplex perception. The base is presented to one ear and the third formants to another. Inthe ear to which the base is presented, listeners hear the syllable specified jointly by the base and the transitions; inthe other ear, they hear the transitions as frequency glides. (Figure adapted from Whalen &; Liberman, 1987).

Page 6: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 107

An open auditory-perceptual module isresponsible for perception of most sounds in theenvironment. According to the theory, outputs ofopen modules are homomorphic.

In the context of this account of auditoryperception, the conditions under which duplexperception is studied are seen as somehowtricking the open module into providing a perceptof the isolated formant transition even though thetransition is also being perceived by the speechmodule. Accordingly, two percepts are provided forone acoustic fragment; one percept ishomomorphic and the other is heteromorphic.

ProspectusOur brief overview of the motor theory obviously

cannot do justice to it. In our view, it is, to date,superior to other theories of speech perception inat least two major respects: in its ability to handlethe full range of behavioral findings on speechperception-in particular, of course, the evidencethat listeners recover phonetic gestures-and inhaving developed its account of speech in thecontext of a more general theory of biologicalspecializations for perception.

Our purpose here, however, is not just to praisethe theory, but to challenge it as well, with thefurther aim of provoking the motor theoristseither to buttress their theory where it appears tous vulnerable, or else to revise it further.

We will raise three general questions from theperspective of our own, direct-realist theory(Fowler, 1986a,b; Rosenblum, 1987). First wequestion the inference from evidence thatlisteners recover phonetic gestures that thelistener's own speech-motor system plays a role inperception. The nature of the challenge we mountto this inference leads to a second one. Wequestion the idea that, whereas a specializedspeech module-and other closed modules­render heteromorphic percepts, other percepts arehomomorphic. Finally, we challenge the idea inany case that duplex perception reveals thatspeech perception is achieved by a closed module.

Standing behind all of these specific questionswe raise about claims of the motor theory is ageneral issue that needs to be confronted by all ofus who study speech perception and, for thatmatter, perception more generally. The issue isone of determining when behavioral data warrantinferences being drawn about perceptualprocesses taking place inside perceivers and whenthe data deserve accounting instead in terms ofthe nature of events taking place publicly whensomething is perceived.

Does perceptual recovery of phoneticgestures implicate the listener's speech

motor system?

In our view, the evidence that perceivers recoverphonetic gestures in speech perception isincontrovertible5 and any theory of speechperception is inadequate unless it can provide aunified account of those findings. However, themotor theorists have drawn an inference fromthese findings that, we argue, is not warranted bythe general observation that listeners recovergestures. The inference is that recovery ofgestures implies access by the perceiver to his ownspeech-motor system. It is notable, perhaps, that,in neither "Perception of the speech code" nor "Themotor theory revised," do Liberman and hiscolleagues offer any evidence in support of thisclaim except evidence that listeners recovergestures (and that human left-cerebralhemispheres are specialized for speech andespecially for phonetic perception [e.g., Kimura,1961; Liberman, 1974; Studdert-Kennedy &Shankweiler, 1970]).

There is another way to explain why listenersrecover phonetic gestures. It is that phoneticgestures are among the "distal events" that occurwhen speech is perceived and that perceptionuniversally involves recovery of distal events frominformation in proximal stimulation.

Distal events universally are perceptualobjects: Proximal stimuli universally arenot.

Consider first visual perception observed fromoutside the perceiver.6 Visual perceivers recoverproperties of objects and events in theirenvironment ("distal events"). They can do so, inpart, because the environment suppliesinformation about the objects and events in a formthat their perceptual systems can use. Lightreflects from objects and events, which structure it

. lawfully; given a distal event and light from somesource, the reflected light must have the structurethat it has. To the extent that the structure in thelight is also specific to the properties of a distalevent that caused it, it can serve as information toa perceiver about its distal source. The reflectedlight ("proximal stimulation") has anotherproperty that permits it its central role inperception. It can stimulate the visual system of aperceiver and thereby impart its structure to it.From there, the perceiver can use the structure asinformation for distal-event perception.

Page 7: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

108 Fowler and Rosenblum

The reflected light does not provide informationto the visual system by picturing the world.Information in reflected light for "looming" (thatis, for an object on a collision course with theperceiver's head), for example, is a certain mannerof expansion of the contours of the object'sreflection in the light, that progressively coversthe contours of optical reflections of immobileparts of the perceiver's environment. When anobject looms, it does not grow; it approaches.However, its optical reflection grows, and,confronted with such an optic array, perceivers(from fiddler crabs to kittens to rhesus monkeys tohumans [Schiff, 1965; Schiff, Caviness, & Gibson,1962]) behave as if they perceive an object on acollision course; that is, they try to avoid it.

Two related conclusions from thischaracterization of visual perception are first thatobservers see distal events based on informationabout them in proximal stimulation and secondthat, in Mattingly and Liberman's terms, visualperception therefore is quite generallyheteromorphic. It is not merely heteromorphic inrespect to those aspects of stimulation handled byclosed modules (for example, one that recoversdepth information from binocular disparity); it isgenerally the case that the dimensions of thepercept correspond with dimensions of distalobjects and events and not necessarily with thoseof a distal-event-free description of the proximalstimulation.7

Auditory perception is analogous to visualperception in its general character, viewed, onceagain from outside the perceiver. Consider anysounding object, a ringing bell, for example. Theringing bell is a "distal event" that structures anacoustic signal. The structuring of the air by thebell is lawful and, to the extent that it also tendsto be specific to its distal source, the structure canprovide information about the source to asensitive perceiver. Like reflected light, theacoustic signal (the proximal stimulation) in facthas two critical properties that allow it to playacentral role in perception. It is lawfully structuredby some distal event and it can stimulate theauditory system of perceivers, thereby impartingits structure to it. The perceiver then can use thestructure as information for its source.

As for structure in reflected light, structure inan acoustic signal does not resemble the sound­producing source in any way. Accordingly, ifauditory perception works similarly to visualperception-that is, if perceivers use structure inacoustic signals to recover their distal sources,

then auditory percepts, like visual percepts will beheteromorphic.

Liberman and Mattingly (1985; Mattingly &Liberman, 1988) suggest, however, that ingeneral, auditory perceptions are homomorphic.We agree that our intuitions are less clear herethan they are in the case of visual perception.However, it is an empirical question whetherdimensions of listeners' percepts are betterexplained in terms of dimensions of distal eventsor of a distal-event free description of proximalstimulation. To date the question is untested,however; for whatever reason, researchers whostudy auditory perception rarely study perceptionof natural sound-producing events (see, however,Repp, 1987; VanDerVeer, 1979; Warren &Verbrugge, 1984).

Now consider speech perception. In speech, thedistal event-at least the event in theenvironment that structures the acoustic speechsignal-is the moving vocal tract. If, as wepropose, the vocal tract produces phoneticgestures, then the distal event is, at the sametime, the set of phonetic gestures that compose thetalker's spoken message. The proximal stimulus isthe acoustic signal, lawfully structured bymovement in the vocal tract. To the extent thatthe structure in the signal also tends to be specificto the events that caused it, it can serve asinformation about those events to sensitiveperceivers. The information that proximalstimulation provides will be about the phoneticgestures of the vocal tract. Accordingly, if speechperception works like visual perception, thenrecovery of phonetic gestures is not eccentric anddoes not require eccentric processing by a speechmodule. It is, instead, yet another instance ofrecovery of distal events by means of lawfully­generated structure in proximal stimulation.

The general point we hope to make is that,arguably, all perception is heteromorphic, withdimensions of percepts always corresponding tothose of distal events, not to distal-event freedescriptions of proximal stimuli. Speech is notspecial in that regard.· A more specific point is thateven if evidence were to show that speechperceivers do access their speech-motor systems,that perceptual process would not be needed toprovide the reason why listeners' percepts areheteromorphic. The reason percepts areheteromorphic is that perceivers universally useproximal stimuli as information about eventstaking place in the world; they do not use them asperceptual objects per se.

Page 8: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 109

Are phonetic gestures public or private?

Although in "Perception of the speech code" and"The motor theory revised," evidence thatlisteners recover gestures is the only evidencecited in favor of the view that perceivers accesstheir speech motor systems, that evidence is notthe only reason why the motor theorists and othertheorists invoke a construct inside the perceiverrather than the proximal stimulation outside toexplain why the percept has the character it does.A very important reason why, for the motortheorists, the proximal stimulation is not by itselfsufficient to specify phonetic gestures is that, intheir view, phonetic gestures are abstract controlstructures corresponding to the speakersintentions, but not to the movements actuallytaking place in the vocal tract. If phoneticgestures aren't "out there" in the vocal tract, thenthey cannot be analogous to other distal events,because they cannot, themselves, lawfullystructure the acoustic signal.

In our view, this characterization of phoneticgestures is mistaken, however. We can identifytwo considerations that appear to support it, butwe find neither convincing. One is that anygesture of the vocal tract is merely a token action.Yet perceivers do not just recognize the token,they recognize it as a member of a largerlinguistically-significant category. That seems tolocalize the thing perceived in the mind of theperceiver, not in the mouth of the talker. Morethan that, the same collections of token gesturesmay be identified as tokens of different categoriesby speakers of different languages. (So, forexample, speakers of English may identify avoiceless unaspirated stop in stressed syllable­initial position as a fbi, whereas speakers oflanguages in which voiceless unaspirated stopscan appear stressed-syllable initially may identifyit as an instance of a Ip/.) Her~, it seems, theinformation for category membership cannotpossibly be in the gestures themselves or in theproximal stimulation; it must be in the head of theperceiver. The second consideration is thatcoarticulation, by most accounts, preventsnondestructive realization of phonetic gestures inthe vocal tract. We briefly address bothconsiderations.

Yet another analogy: There are chairs in theworld that do not look very much like prototypicalchairs. Recognizing them as chairs may requirelearning how people typically use them (learningtheir "proper function" in Millikan's terms [1984]).By most accounts, learning involves some

enduring change inside the perceiver. Notice,however, that even if it 'does, what makes thetoken chair a chair remains its properties and itsuse in the world prototypically as a chair. Too,whatever perceivers may learn about that chairand about chairs in general is only what theylearn; the chair itself and the means by which itstype-hood can be identified remain unquestionablyout there in the world. Phonetic gestures andphonetic segments are like chairs (in this respect).Token instances of bilabial closure are members ofa type because the tokens all are products of acommon coupling among jaw and lips realized inthe vocal tract of talkers who achieve bilabialclosure. Instances of bilabial closure in stressed­syllable-initial position that have a particulartiming relation to a glottal opening gesture aretokens of a phonological category, fbi, in somelanguages and of a different category, Ip/, inothers because of the different ways that they aredeployed by members of the different languagecommunities. That differential deployment iswhat allowed descriptive linguists to identifymembers of phonemic categories as such, andpresumably it is also what allows languagelearners to acquire the phonological categories oftheir native language. By most accounts, whenlanguage learners discover the categories of theirlanguage, the learning involves enduring changesinside the learner. However, even if it does, it isno more the case that the phonetic gestures or thephonetic segments move inside the mind than it isthat chairs move inside when we learn how torecognize them as such. What we hav~ learned iswhat we know about chairs and phoneticsegments; it is not the chairs or the phoneticsegments themselves. They remain outside.

Turning to coarticulation, it is described in themotor theory as "encoding," by Ohala (e.g., 1981)as "distortion," by Daniloff and Hammarberg(1973) as "assimilation" and by Hockett (1955) as"smashing" and "rubbing together" of phoneticsegments (in the way that raw eggs would besmashed and rubbed together were they sentthrough a wringer). None of thesecharacterizations is warranted, however.Coarticulation may instead be characterized asgestural layering-a temporally staggeredrealization of gestures that sometimes do andsometimes do not share one or more articulators.

In fact, this kind of gestural layering occurscommonly in motor behavior. When someonewalks, the movement of his or her arm is seen aspendular. However, the surface movement is acomplex (layered) vector including not only the

Page 9: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

110 Fowler and Rosenblum

swing of the arm, but also movement of the wholebody in the direction of locomotion. This layeringis not described as "encoding," "distortion" or evenas assimilation of the arm movement to themovement of the body as a whole. And for goodreason; that is not what it is. The movementreflects a convergence of forces of movement on abody segment. The forces are separate for thewalker, information in proximal stimulationallows their parsing (Johansson, 1973), andperceivers detect their separation.

There is evidence already suggesting that atleast some of coarticulation is gestural layering(Carney & Moll, 1971; Ohman, 1966; also seeBrowman & Goldstein, in press a), not encoding ordistortion or assimilation. There is also convincingevidence that perceivers recover separate gesturesmore-or-Iess in the way that Johansson suggeststhey recover separate sources of movement ofbodysegments in perception of locomotion. Listenersuse information for a coarticulating segment thatis present in the domain of another segment asinformation for the coarticulating segment itself(e.g., Fowler, 1984; Mann, 1980; Whalen, 1984);they do not hear the coarticulated segment asassimilated or, apparently, as distorted or encoded(Fowler, 1981; 1984; Fowler & Smith, 1986).

Our colleagues Catherine Browman and LouisGoldstein (1985, 1986, in press a,b) have proposedthat phonetic primitives oflanguages are gestural,not abstract featural. Our colleague ElliotSaltzman (1986; Saltzman & Kelso, 1987; see also,Kelso, Saltzman, & Tuller, 1986) is developing amodel that implements phonetic gestures asfunctional couplings among the articulators andthat realizes the gestural layering characteristicof coarticulation. To the extent that theseapproaches both succeed, they will show thatphonetic gestures-speakers' intentions-can berealized in the vocal tract nondestructively, andhence can structure acoustic signals directly.

Do listeners need an innate vocal tractsynthesizer to recognize acoustic reflections ofphonetic gestures? Although it might seem tohelp, it cannot be necessary, because there is noanalogous way to explain how observers recognizemost distal events from their optical reflections.Somehow the acoustic and optical reflections of asource must identify the source on their own. Insome instances, we begin to understand themeans by which acoustic patternings can specifytheir gestural sources. We consider one suchinstance next.

How acoustic structure may serve asinformation for gestures.

We return to the example previously describedof listeners' perception of those linguisticdimensions of an utterance that are cued in someway by variation in fo. A variety of linguistic andparalinguistic properties of an utterance haveconverging effects on fo. Yet listeners pull apartthose effects in perception.

What guides the listeners' factoring ofconverging effects of fo? Presumably, it is theconfiguration of acoustic products of the severalgestures that have effects, among others, on fo.Intonational peaks are local changes in an focontour that are effected by means that, to a firstapproximation, only affect fo; they are produced,largely, by contraction and relaxation of musclesthat stretch or shorten the vocal folds (e.g., Ohala,1978). In contrast, declination is a global changein fo that, excepting the initial peak in a sentence,tracks the decline in subglottal pressure (Gelfer etal., 1985; Gelfer et al., 1987). Subglottal pressureaffects not only fo, but amplitude as well, andseveral researchers have noticed that amplitudedeclines in parallel with fo and resets when foresets at major syntactic boundaries (e.g.,Breckenridge, 1977; Maeda, 1976). The paralleldecline in amplitude and fo constitutesinformation that pinpoints the mechanism behindthe fo decline-gradual lung deflation,incompletely offset by expiratory-muscle activity.That mechanism is distinct from the mechanismby which intonational peaks are produced.Evidence that listeners pull apart the two effectson fo (Pierrehumbert, 1979; Silverman, 1987)suggests that they are sensitive to the distinctgestural sources of these effects on fo.

By the same token, fo perturbations due toheight differences among vowels are not confusedby listeners with information for intonationalpeak height even though fo differences due tovowel height are local, like intonational peaks,and are similar in magnitude to differences amongintonational peaks in a sentence (Silverman,1987). The mechanisms for the two effects on foare different, and, apparently, listeners aresensitive to that. Honda (1981) shows a strongcorrelation between activity of the genioglossusmuscle, active in pulling the root of the tongueforward for high vowels, and intrinsic fo of vowels.Posterior fibers of the genioglossus muscle insertinto the hyoid bone of the larynx. Therefore,contraction of the genioglossus may pull the hyoid

Page 10: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 111

forward, rotating the thyroid cartilage to whichthe vocal folds attach, and thereby may stretchthe vocal folds. Other acoustic consequences ofgenioglossus contraction, of course, are changes inthe resonances of the vocal tract, which reflectmovement of the tongue. These changes, alongwith those in fo (and perhaps others as well)pinpoint a phonetic gesture that achieves a vowel­specific change in vocal-tract shape. If listenerscan use that configuration of acoustic reflections oftongue-movement (or, more likely, of coordinatedtongue and jaw movement) to recover the vocalicgesture, then they can pull effects on fo of thevocalic gesture from those for the intonationcontour that cooccur with them.

Listeners do just that. In sentence pairs such as"They only feast before fasting" and "They onlyfast before feasting," withintonational peaks onthe "fVst" syllables, listeners require a higherpeak on "feast" in the second sentence than on"fast" in the first sentence in order to hear thefirst peak of each sentence as higher than thesecond (Silverman, 1987). Compatibly, amongsteady-state vowels on the same fo' more openvowels sound higher in pitch than more closedvowels (Stoll, 1984). Intrinsic fo of vowels does notcontribute to perception of an intonation contouror to perception of pitch. But it is not thrown awayby perceivers either. Rather, along with spectralinformation for vowel height, it serves asinformation for vowel height (Reinholt Peterson,1986).

We will not review the literature on listeners'use of fo perturbations due to obstruent voicingexcept to say that it reveals the same picture ofthe perceiver as the literature on listeners' use ofinformation for vowel height (for a description ofthe fo perturbations: Ohde, 1984; for studies oflisteners' use of the perturbations: Abramson &Lisker, 1985; Fujimura, 1971; Haggard, Ambler, &Callow, 1970; Silverman, 1986; for evidence thatlisteners can detect the perturbations when theyare superimposed on intonation contours:Silverman, 1986). As the motor theory and thetheory of direct perception both claim, listeners'percepts do not correspond to superficial aspects ofthe acoustic signal. They correspond to gestures,signaled, we propose, by configurations of acousticreflections of those gestures.

Does duplex perception reveal a closedspeech module?

We return to the phenomenon of duplexperception and consider whether it doesconvincingly reveal distinct closed and open

modules for speech perception and generalauditory perception respectively. As noted earlier,duplex perception is obtained, typically, whenmost of the acoustic structure of a syntheticsyllable is presented to one ear, and theremainder-usually a formant transition-ispresented to the other ear (refer to Figure 2). Insuch instances, listeners hear two things. In theear that gets most of the signal, they hear acoherent syllable, the identity of which isdetermined by the transition presented to theother ear. At the same time, they hear a distinct,non-speech 'chirp' in the ear receiving thetransition. The percept is duplex-the transitionis heard as a critical part of a speech syllable,hypothetically as a result ofit's being processed bythe speech module, and it is heard simultaneouslyas a non-speech chirp, hypothetically as a result ofits being processed also by an open auditorymodule (Liberman & Mattingly, 1985). Here weoffer a different interpretation of the findings.

Whalen and Liberman (1987) have recentlyshown that duplex perception can occur withmonaural or diotic presentation of the base andtransition of a syllable. In this case, duplexity isattained by increasing the intensity of the thirdformant transition relative to the base untillisteners hear both an integrated syllable (ldaJ or/gaJ depending on the transition) and a non-speech'whistle' (sinusoids were used for transitions). Inthe experiment, subjects first were asked to labelthe isolated sinusoidal transitions as "da" or "ga".Although they were consistent in their labeling,reliably identifying one whistle as "da" and theother as "ga," their overall accuracy was notgreater than chance. About half the subjects wereconsistently right and the remainder wereconsistently wrong. The whistles are distinct, butthey do not sound like "da" or "ga." Next, Whalenand Liberman determined 'duplexity thresholds'for listeners. They presented the base and one ofthe sinusoids simultaneously and gave listenerscontrol over the intensity of the sinusoid.Listeners adjusted its intensity to the point wherethey just heard a whistle. At threshold, subjectswere able to match these duplex sinusoids tosinusoids presented in isolation. Finally, subjectswere asked to identify the integrated speechsyllables as "da" or "ga" at sinusoid intensitiesboth 6 dB above and 4 dB below the duplexitythreshold. Subjects were consistently good atthese tasks yielding accuracy scores well above90%.

In the absence of any transition, listeners hearonly the base and identify it as "da" most of the

Page 11: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

112 Fowler and Rosenblum

time. When a sinusoidal transition is present butat intensities below the duplexity threshold,subjects hear only the unambiguous syllable ("da"or "ga" depending on the transition). Finally,when the intensity of the transition reaches andexceeds the duplexity threshold, subjects hear the"dan or "ga" and they hear a whistle at the sametime: i.e., the transition is duplexed.

This experiment reveals two new aspects of theduplex phenomenon. One is that getting a duplexpercept requires a sufficiently high intensity ofthe transition. A second is that the transitionintegrates with the syllable at intensities belowthe duplexity threshold. Based on this latterfinding, Whalen and Liberman conclude thatprocessing of the sinusoid as speech has priority.It is as if a (neurally-encoded) acoustic signalmust first pass through the speech module atwhich point portions of the signal that specifyspeech events are peeled off. After the speechmodule takes its part, any residual is passed on tothe auditory module where it is perceivedhomomorphically. Mattingly and Liberman (1988)refer to this priority of speech processing as"preemptiveness," and Whalen and Liberman(1987) suggest that it reflects the "profoundbiological significance of speech."

There is another way to look at these findings,however. They suggest that duplex perceptiondoes not, in fact, involve the same acousticfragmen t being perceived in two wayssimultaneously. Rather part of the transitionintegrates with the syllable and the remainder isheard as a whistle or chirp.S As Whalen andLiberman themselves describe it:

". . . the phonetic mode takes precedence inprocessing the transitions, using them for its speciallinguistic purposes until, having appropriated itsshare, it passes the remainder to be perccived bythe nonspeech system as auditory whistles."(Whalen & Liberman, 1987, p. 171; our italics).

This is important, because in earlier reports ofduplex perception, it was the apparent perceptionof the transition in two different ways at once thatwas considered strong evidence favoring twodistinct perceptual systems, one for speech andone for general auditory perception. In addition,research to date has only looked for preemp­tiveness using speech syllables. Accordingly, it ispremature to conclude that speech especially ispreemptive. Possibly acoustic fragments integratepreferentially whenever the integrated signalspecifies some coherent sound-producing event.

We have recently looked for duplex perception inperception of nonspeech sounds (Fowler andRosenblum, in press). We predicted that it wouldbe possible to observe duplex perception andpreemptiveness whenever two conditions are met:1) A pair of acoustic fragments is presented that,integrated, specify a natural distal event; and 2)one of the fragments is unnaturally intense.Under these conditions, the integrated eventshould be preemptive and the intense fragmentshould be duplexed regardless of the type ofnatural sound-producing event that is involved,whether it is speech or non-speech, and whether itis profoundly biologically significant or biologicallytrivial.

There have been other attempts to get duplexperception for nonspeech sounds. All the ones ofwhich we are aware have used musical stimuli,however (e.g., Collins, 1985; Pastore, Schmuckler,Rosenblum, & Szczesuil, 1983). We chose not touse musical stimuli because it might be arguedthat there is a music module. (Music is universalamong human cultures, and there is evidence foran anatomical specialization of the brain for musicperception (e.g., Shapiro, Grossman, & Gardner,1981). These considerations led us to choose anon-speech event that evolution could not haveanticipated. We chose an event involving a recenthuman artifact: a slamming metal door.

To generate our stimuli, we recorded a heavymetal door (of a sound-attenuating booth) beingslammed shut. A spectrogram of this sound can beseen in Figure 3a. To produce our 'chirp,' we high­pass filtered the signal above 3000 Hz. To producea 'base,' we low-passed filtered the original signal,also at 3000 Hz (see bottom panels of Figure 3). Tous, the high-passed 'chirp' sounded like a can ofrice being shaken, while the low-pass-filtered basesounded like a wooden door being slammed shut.(That is, the clanging of the metal door waslargely absent.)

'We asked sixteen listeners to identify theoriginal metal door, the base and the chirp. Themodal identifications of the metal door and thebase included mention of a door; however, lessthan half the subjects reported hearing a doorslam. Even so, essentially all of the identificationsinvolved hard collisions of some sort (e.g., bootsclomping on stairs, shovel banged on sidewalk). Incontrast, no subject identified the chirp as a door'sound, and no identifications described hardcollisions. Most identifications of the chirpreferred to an event involving shaking(tambourine, maracas, castinets, keys).

Page 12: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 113

ORIGIN 0.003.0'

I2.0 I -20.00

I '"1.0

~."5i

I~'" c -40.00:a 0.0QJ

a

~,> ~1.0 U -60.00

I QJ

2.0' 0-ff)

I -80003.0 I C7'0.000 0.020 0.040 0.060 0.080 0.100 0.120 0.140 0.160 0

...J

SECONDS -100000 2 4 6 8 10

000'".~ -20.00a

~ -4000uQJ

~ -60.00C7'o

...J -80.00

-10000 "f'0W&.."-..,.2-'----,4r-'---6,.----f.8

l.JLJLJJ.U.,10

Frequency in kHz

Frequency ,n kHz

~ 000'onc.3 -20.00

o-4000

UQJ0-

ff) -6000C7'o

...J -80.00

-100.00 j------r--'--.---.~--,-__,

024 6 8 ~

Frequency ,n kHz

Figure 3. Display of stimuli used to obtain duplex perception of closing-door sounds.

Given our metal-door chirp and a high-pass­filtered wooden door slam, subjects could notidentify which was a filtered metal door slam andwhich a filtered wooden door slam. Subjects wereconsistent in their labeling judgments, identifying one of the chirps as a metal door and theother as a wooden door. However, overall, more ofthem were consistently wrong than right. Onaverage, they identified the metal-door chirp asthe sound of a metal door 31% of the time and thewooden-door chirp as a metal door sound 79% ofthe time.9

To test for duplex perception and preemptive­ness, we first trained subjects to identify theunfiltered door sound as a "metal door," the baseas a "wooden door" and the upper frequencies ofthe door as a shaking sound. Next we tested themon stimuli created from the base and the chirp. Wecreated 15 different diotic stimuli. All included thebase, and almost all included the metal-door chirp.The stimuli differed in the intensity of the chirp.The chirp was attenuated or amplified bymultiplying its digitized voltages by the followingvalues: 0, .05, .1, .15, .2, .9, .95, 1, 1.05, 1.1,4,4.5,5, 5.5 and 6. That is, there were 15 differentintensities falling into three ranges; five were wellbelow the natural intensity relationship of thechirp to the base, five were in the range of thenatural intensity relation, and five were well

above it. Three tokens of each of these stimuliwere presented to subjects diotically in arandomized order. Listeners were told that theymight hear one of the stimuli, metal door, woodendoor or shaking sound, or sometimes two of themsimultaneously, on each trial. They were toindicate what they heard on each trial by writingan identifying letter or pair of letters on theiranswer sheets.

In our analyses, we have grouped responses tothe 15 stimuli into three blocks of five. In Figure4, we have labeled these Intensity Conditions low,medium, and high. Figure 4 presents the resultsas percentages of responses in the variousresponse categories across the three intensityconditions. We show only the three mostinteresting (and most frequent) responses. Thefigure shows that the most frequent response forthe low intensity condition is 'wooden door,' thelabel we asked subjects to use when they heardthe base. The most frequent response for themedium condition is 'metal door,' the label weasked subjects to use when they heard the metaldoor slam. The preferred response for the high­intensity block of stimuli is overwhelmingly 'metaldoor + chirp,' the response that indicates a duplexpercept. The changes in response frequency overthe three intensity conditions for each responsetype are highly significant.

Page 13: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

114 Fowler and Rosenblum

80 El 'Wooden Door'

~ FJJ 'Metal Door'0

1m 'Metal Door + Chirp'>-(.) 60l::Q,)::JC'"Q,)....u. 40Q,)C/)

l::0Co 20C/)Q,)

a:

0low Middle High

Intensity Condition

Figure 4. Percentage of responses falling in the three response categories, "wooden door," "metal door," and "metaldoor plus shaking sound" across three intensity ranges of the shaking sound.

Our results can be summarized as follows. First,at very low intensities of the upper frequencies ofthe door, subjects hear the base only. When the'chirp' is amplified to an intensity at or near itsnatural intensity relation to the base, subjectsreport hearing a metal door the majority of thetime. Further amplification of the 'chirp,' leads toreports of the metal door and a separate shakingsound. The percept is duplex, and the metal doorslam is preemptive.

There are several additional tests that we mustrun to determine whether our door slams, in fact,are perceived analogously to speech syllables inprocedures revealing duplex perception. If we canshow that they are, then we will conclude that anaccount of our findings that invokes a closedmodule is inappropriate. Evolution is unlikely tohave anticipated metal door slams, and metal­door slams aren't profoundly biologicallysignificant. We suggest alternatively thatpreemptiveness occurs when a chirp fills a "hole"in a simultaneously presented acoustic signal sothat, together the two parts of the signal specifysome sound-producing distal event. If anything is

left over after the hole is filled, the remainder isheard as separate.

Summary and concluding remarksWe have raised three challenges to the motor

theory. We challenge their inference fromevidence that phonetic gestures are perceived thatspeech perception involves access to the talker'sown motor system. The basis for our challenge is aclaim that dimensions of percepts always conformto those of distal events even in cases whereaccess to an internal synthesizer for the events isunlikely. A second, related, challenge is to the ideathat only some percepts are heteromorphic-justthose for which we have evolved closed modules.When Liberman and Mattingly write that speechperception is heteromorphic, they meanheteromorphic with respect to structure inproximal stimulation, but they always mean aswell that the percept is homomorphic with respectto dimensions of the distal source of the proximalstimulation. We argue that percepts are generallyheteromorphic with respect to structure inproximal stimulation, but, whether they are or

Page 14: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 115

not, they are always homomorphic with respect todimensions of distal events. Finally, we challengethe interpretation of duplex perception thatascribes it to simultaneous processing of one partof an acoustic signal by two modules. We suggest,instead, that duplex perception reflects thelistener's parsing of acoustic structure intodisjoint parts that specify, insofar as the acousticstructure permits, coherent distal events.

Where (in our view) does this leave the motortheory? It is fundamentally right in its claim thatlisteners perceive phonetic gestures, and also,possibly, in its claim that humans have evolvedneural systems specialized for perception andproduction of phonetic gestures. It is wrong, webelieve, specifically in its claims about what thosespecialized systems do, and generally in the viewthat closed modules must be invoked to explainwhy distal events are perceived.

Obviously, we prefer our own, direct-realist,theory, not so much because it handles the databetter, but because, in our view, it fits better in auniversal theory of perception. But however ourtheory may be judged in relation to the motortheory, we recognize that we would not havedeveloped it at all in the absence of the importantdiscoveries of the motor theorists that gesturesare perceived.

REFERENCESAbramson, A. & Lisker, L. (1985). Relative power of cues: Fo shift

versus voice liming. In V. Fromkin (Ed.), Phonetic linguistics:Essays in honor of Peter lAdefoged (pp. 25-32). Orlando, FL:Academic Press.

Breckenridge, J. (1977). Declination as a phonological process. BellLaboratories Technological Memo. Murray Hill, NJ.

Bregman, A. (1987). The meaning of duplex perception. In M. E.H. Schouten (Ed.), The psychophysics of speech perception (pp. 95­111). Dordrecht: Martinus Nijhoff.

Browman, C, & Goldstein, L. (1985). Dynamic modeling ofphonetic structure. In V. Fromkin (Ed.), Phonetic Linguistics:Essays in honor of Peter lAdefoged (pp. 35-53). Orlando, FL:Academic Press.

Browman, C, & Goldstein, L. (1986). Towards an articulatoryphonology. In C Ewan & J. Anderson (Eds.), PhonologyYearbook, 3 (pp. 219-254). Cambridge: Cambridge UniversityPress.

Browman, C, & Goldstein, L. (in press a). Tiers in articulatoryphonology with some implications for casual speech. In J.Kingston & M. Beckman (Eds.), Papers in laboratory phonology, I.:Between the grammar and the physics of speech. Cambridge:Cambridge University Press.

Browman, C, & Goldstein, L. (in press b). Gestural structures andphonological patterns. In I. G. Mattingly & M. Studdert­Kennedy (Eds.), Modularity and the motor theory of speechperception. Hillsdale: NJ: Lawrence Erlbaum Associates.

Carney, P., & Moll, K. (1971). A cinefluorographic investigation offricative-consonant vowel coarticulation. PhonetiGl, 23, 193-201.

Collins, S. (1985). Duplex perception with musical stimuli: Afurther investigation. Perception and Psychophysics, 38,172-177.

Cooper, F., Delattre, P., Liberman, A., Borst, J., & Gerstman, L.(1952). Some experiments on the perception of synthetic speechsounds. Journal of the AcoustiGlI Society of AmeriGl, 24, 597-606.

Daniloff, R., & Hammarberg, R. (1973). On definingcoarticulation. Journal of Phonetics, 1,239-248.

Fitch, H., Halwes, T., Erickson, D., & Liberman, A. (1980).Perceptual equivalence of two acoustic cues for stop consonantmanner. Perception and Psychophysics, 27, 343-350.

Fodor, J. (1983). The modularity of mind. Cambridge, MA: MITPress.

Folkins, J., & Abbs, J. (1975). Lip and jaw motor control duringspeech: Responses to resistive loading of the jaw. Journal ofSpeech and Hearing Research, 18, 207-220.

Folkins, J., & Abbs, J. (1976). Additional observations onresponses to resistive loading of the jaw. Journal of Speech andHearing Research, 19, 820-821.

Fowler, C. (1981). Production and perception of coarticulationamong stressed and unstressed vowels. Journal of Speech andHearing Research, 46, 127-139.

Fowler, C (1984). Segmentation of coarticulated speech inperception. Perception and Psychophysics, 36L359-368.

Fowler, C (1986a). An event approach to the study of speechperception from a direct-realist perspective. Journal ofPhonetics,14,3-28.

Fowler, C (1986b). Reply to commentators. Journal of Phonetics, 14,149-170.

Fowler, C, & Rosenblum, L. D. (in press). Duplex perception: Acomparison of monosyllables and slamming of doors. Journal ofExperimental Psychology: Human Perception and Performance.

Fowler, C, & Smith, M. (1986). Speech perception as "vectoranalysis": An approach to the problems of segmentation andinvariance. In J. Perkell & D. Klatt (Eds.), Invariance andvariability of speech processes (pp. 123-135). Hillsdale, NJ:Lawrence Erlbaum Associates.

Fujimura, O. (1971). Remarks on stop consonants: Synthesisexperiments and acoustic cues. In L. Hammerich, R. Jakobson,& E. Zwirner (Eds.), Form and substance (pp. 221-232).Copenhagen: Akademisk Forlag.

Gelfer, C, Harris, K., Collier, R., & Baer, T. (1985). Is declinationactively controlled? In I. Titze (Ed.), Vocal-fold physiology:Physiological and biophysics of the voice. Iowa City: IowaUniversity Press.

Geifer, C, Harris, K, & Baer, T. (1987). Controlled variables insentence intonation. In T. Baer, C Sasaki, & K. Harris (Eds.),lAryngeal function in phonation and respiration (pp. 422-432).Boston: College-Hill Press.

Gibson, J. J. (1966). The senses considered as perceptual systems.Boston: Houghton-Mifflin.

Gibson, J. J. (1979). The ecological approach to visual perception.Boston: Houghton-Mifflin.

Haggard, M., Ambler, S., & Callow, M. ( 1970). Pitch as a voicingcue. Journal of the Acoustical Society of AmeriGl, 47, 613-617.

Hockett, C (1955). Manual of phonology. (Publications inanthropology and linguistics, No. 11). Bloomington: IndianaUniversity Press.

Honda, K. (1981). Relationship between pitch control and vowelarticulation. In D. Bless & J. Abbs (Eds.), Vocal-fold physiology(pp. 286-297). San Diego: College-Hill Press

Jakobson, R., Fant, C G. M., & Halle, M. (1951). Preliminaries tospeech analysis: The distinctive features and their correlates.Cambridge, MA: MIT Press.

Johansson, G. (1973). Visual perception of biological motion.Perception and Psychophysics, 14, 201-211.

Kelso, J. A. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C(1984). Functionally-specific articulatory cooperation followingjaw perturbations during speech: Evidence for coordinative

Page 15: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

116 Fowler and Rosenblum

structures. Tournal of Experimental Psychology: Human Perceptionand Perfrmnance, 10, 812-832.

Kelso, J. A. 5., Saltzman, E., &: Tuller, B. (1986). The dynamicalperspective on speech production: Data and theory. Tournai ofPhonetics, 14, 29-59.

Kimura, D. (1961). Cerebral dominance and the perception ofverbal stimuli. Canadian Journal of Psychology, 15, 166-171.

Lehiste, I. (1982). Some phonetic characteristics of discourse.Studia Linguistica, 36, 117-130.

Lehiste, I., &: Peterson, G. (1961). Some basic considerations in theanalysis of intonation. Journal of the Acoustical Society ofAmerica,33,419-425.

Liberman, A. (1974). The specialization of the languagehemisphere (pp. 43-56). In F. O. Schmitt &: F. G. Worden (Eds.),The neurosciences: Third study program. Cambridge, MA: MITPress.

Liberman, A (1982). On finding that speech is special. AmericanPsychologist,37,148-167.

Liberman, A, Cooper, F., Harris, K., &: MacNeilage, P. (1963). Amotor theory of speech perception. Proceedings of the SpeechCommunication Seminar, Stockholm: Royal Institute ofTechnology, 03.

Liberman, A., Cooper, F., Shankweiler, D., &: Studdert-Kennedy,M. (1967). Perception of the speech code. Psychological Review,74,431-461.

Liberman, A., Isenberg, D., &: Rakerd, B. (1981). Duplexperception of cues for stop consonants: Evidence for a phoneticmodule. Perception and Psychophysics, 30,133-143.

Liberman, A., &: Mattingly, I. (1985). The motor theory of speechperception revised. Cognition, 21,1-36.

L6fqvist, A., Baer, T., McGarr, N. 5., &: Seider Story, R. (in press).The cricothyroid muscle in voicing control. Tournai of theAcoustical Society ofAmerica.

MacDonald, M., &: McGurk, H. (1978). Visual influences onspeech perception. Perception and Psychophysics, 24,253-257.

Maeda, S. (1976). A characterization of American English intonation.Unpublished doctoral dissertation, Massachusetts Institute ofTechnology .

Mann, V. (1980). Influence of preceding liquid on stop consonant. perception. Perception and Psychophysics, 28, 407-412.

Mann, V. &: Liberman, A. (1983). Some differences betweenphonetic and auditory modes of perception. Cognition, 14, 211­235.

Massaro, D. (1987). Speech perception by ear and eye: A paradigm forpsychological inquiry. Hillsdale, NJ: Lawrence ErlbaumAssociates.

Mattingly, I. G., &: Liberman, A. M. (1988). Specialized perceivingsystems for speech and other biologically-significant sounds. InG. Edelman, W. Gall, &: W. Cowan (Eds.), Auditory function: Theneurobiological bases ofhearing (pp. 775-793). New York: Wiley.

Mattingly, 1. G., &: Liberman, A M. (in press). Speech and otherauditory modules. In G. Edelman, W. Gall, &: W. Cowan (Eds.),Signal and sense: Local and global order in perceptual maps. NewYork: Wiley.

Millikan, R. (1984). Language and other biological categories.Cambridge, MA: MIT Press.

Ohala, J. (1978). Production of tone. In V. Fromkin (Ed.), Tone: Alinguistic suroey (pp. 5-39). New York: Academic Press.

Ohala, J. (1981). The listener as a source of sound change. In C.Masek, R. Hendrick, &: M. Miller (Eds.), Papers from theparasession on language and behavior (pp. 178-203). Chicago:Chicago Linguistics Society.

Ohde, R (1984). Fundamental frequency as an acoustic correlateof stop consonant voicing. Tournai of the Acoustical Society ofAmerica, 75,224-240.

Ohman, S. (1966). Coarticulation in VCV utterances:Spectrographic measures. Tournai of the Acoustical Society ofAmerica,39,151-168.

Pastore, R, Schmuckler, M., Rosenblum, L. D., &: Szczesiul, R.(1983). Duplex perception with musical stimuli. Perception andPsychophysics,33,323-332.

Pierrehumbert, J. (1979). The perception of fundamentalfrequency. Tournal of the Acoustical Society of America, 66, 363­369.

Porter, R. (1978). Rapid shadowing of syllables: Evidence forsymmetry of speech perceptual and motor systems. Paperpresented at the meeting of the Psychonomics Society, SanAntonio.

Porter, R., &: Lubker, J. (1980). Rapid reproduction of vowel-vowelsequences: Evidence for a fast and direct acoustic-motoriclinkage in speech. Tournai of Speech and Hearing Research, 23,593­602.

Rand, T. (1974). Dichotic release from masking for speech. Journalof the Acoustical Society of America, 55, 678-680.

Reed, E. &: Jones, R. (1982). Reasons for realism: Selected essays ofTames J. Gibson. Hillsdale, NJ: Lawrence Erlbaum Associates.

Reinholt Peterson, N. (1986). Perceptual compensation for seg­mentally-conditioned fundamental-frequency perturbations,Phonetica, 43, 31-42.

Remez, R., Rubin, P., Pisoni, D., &: Carrell, T. (1981). Speechperception without traditional speech cues. Science, 212, 947­950.

Repp, B. (1987). The sound of two hands clapping: An exploratorystudy. Tournal of the Acoustical Society of America, 81, 1100-1109.

. Repp, B., Milburn, C. &: Ashkenas, J. (1983). Duplex perception:Confirmation of fusion. Perception and Psychophysics, 33, 333­338.

Rosenblum, L. D. (1987). Towards an ecological alternative to themotor theory of speech perception. PAW Review (TechnicalReport of Center for the Ecological Study of Perception andAction, University of Connecticut), 2, 25-29.

Saltzman, E. (1986). Task dynamic coordination of the speecharticulators. In H. Heuer &: C. Fromm (Eds.), Generation andmodulation of action patterns (pp. 129-144). (Experimental BrainResearch Series 15). New York: Springer-Verlag. .

Saltzman, E., &: Kelso, J. A. S. (1987). Skilled actions: A task­dynamic approach. Psychological Review, 94, 84-106.

Schiff, W. (1965). Perception of impending collision. PsychologicalA1onographs. 79,No.604.

Schiff, W., Caviness, J., &: Gibson, J. (1962). Persistent fearresponses in rhesus monkeys to the optical stimulus of"looming." Science, 136, 982-983.

Shapiro, B., Grossman, M., &: Gardner, H. (1981). Selectiveprocessing deficits in brain-damaged populations.Neuropsycholgia, 19, 161-169.

Silverman, K. (1986). Fo segmental cues depend on intonation:The case of the rise after voiced stops. Phonetica, 43, 76-92.

Silverman, K. (1987). The structure and processing of fundamentalfrequency contours. Unpublished doctoral dissertation,Cambridge University.

Stevens, K. (1960). Toward a model for speech recognition. Tournaiof the Acoustical Society of America, 32, 47-55.

Stevens, K., &: Blumstein, 5.(1981). The search for invariantcorrelates of acoustic features. In P. Eimas &: J. Miller (Eds.),Perspectives on the study of speech (pp. 1-38). Hillsdale, NJ:Lawrence Erlbaum Associates.

Stevens, K., &: Halle, M. (1967). Remarks on analysis by synthesisand distinctive features. In W. Wathen-Dunn (Ed.), A10dels forthe perception of speech and visual form (pp. 88-102) . Cambridge,MA: MIT Press.

Page 16: The Perception ofPhonetic Gestures* - Haskins … · Haskins Laboratories StatusReport on Speech Research 1989, SR-99/l00,102-117 The Perception ofPhonetic Gestures* Carol A. Fowlert

The Perception of Phonetic Gestures 117

Stoll, G. (1984). Pitch of vowels: Experimental and theoreticalinvestigation of its dependence on vowel quality. SpeechCommunication, 3,137-150.

Studdert-Kennedy, M. (1986). Two cheers for direct realism.Jou17lll1 ofPhonetics; 14, 99-104.

Studdert-Kennedy, M., & Shankweiler, D. (1970). Hemisphericspecialization for speech perception. Journal of the AcousticalSociety ofAmerica, 48, 579-594.

VanDerVeer, N. (1979). Ecological acoustics: Human perception ofen'Dironmental sounds. Unpublished doctoral dissertation,Cornell University.

Warren, W. & Verbrugge, R. (1984). Auditory perception ofbreaking and bouncing events: A case study in ecologicalacoustics. Journal of Experimental Psychology: Human Perceptionand Performance, 10, 704-712.

Whalen, D. (1984). Subcategorical mismatches slow phoneticjudgments. Perception and Psychophysics, 35, 49-64.

Whalen, D. & Liberman, A. (1987). Speech perception takesprecedence over nonspeech perception. Science, 237,169-171.

FOOTNOTES"In I. G. Mattingly & M. Studdert-Kennedy (Eds.), Modularityand the motor theory of speech perception. Hillsdale, NJ: LawrenceErlbaum Associates, in press.

tAlso Dartmouth College, Hanover, New Hampshire.ttAlso University of Connecticut, Storrs. Now at the University

of California at Riverside, Department of Psychology.1There is a small qualification to the claim that listeners cannot

tell what contributions visible and audible information eachhave to their perceptual experience in the McGurk effect.Massaro (1987) has shown that effects of the video display canbe reduced but not eliminated by instructing subjects to look at,but to ignore, the display.

2Uberman et al. identify a cipher as a system in which eachunique unit of the message maps onto a unique symbol. Incontrast, in a code, the correspondence between message unitand symbol is not 1:1.

3Liberman et al. propose to replace the more conventional viewof the features of a phoneme (for example, that of Jakobson,Fant & Halle, 1951) with one of features as "implicitinstructions to separate and independent parts of the motormachinery" (p. 446).

4With one apparent slip on page 2: "The objects of speechperception are the intended phonetic gestures of the speaker,represented in the brain as invariant motor commands...."

sane can certainly challenge the idea that listeners recover thevery gestures that occurred to produce a speech signal.Obviously there are no gestures at all responsible for mostsynthetic speech or for "sine-wave speech" (e.g., Remez, Rubin,

Plsoni, & Carrell, 1981) and quite different behaviors underlie aparrot's or mynah bird's mimicking of speech. The claim thatwe argue is incontrovertible is that listeners recover gesturesfrom speech-like signals, even those generated in some otherway. (We direct realists [Fowler, 1986a,bj would also argue that"misperceptions" (hearing phonetic gestures where there arenone) can only occur in limited varieties of ways-the mostnotable being signals produced by certain mirage-prodUcinghuman artifacts, such as speech synthesizers or mirage­producing birds. Another, however, possibly, includes signalsproduced to mimic those of normal speakers by speakers withpathologies of the vocal tract that prevent normal realization ofgestures.)

6There are two almost orthogonal perspectives from whichperception can be studied. On the one hand, investigators canfocus on processes inside the perceiver that take place from thetime that a sense organ is stimulated until a percept is achievedor a response is made to the input. On the other hand, they canlook outside the perceiver and ask what, in the environment,the organism under study perceives, what information instimulation to the sense organs allows perception of the thingsperceived, and finally, whether the organisms in fact use thepostulated information. Here we focus on this latterperspective, most closely associated with the work of JamesGibson (e.g., 1966; 1979; Reed & Jones, 1982).

7It is easy to find examples in which perception is heteromorphicwith respect to the proximal stimulation and homomorphicwith respect to distal events-looming, for example. We canalso think of some examples in which perception appearshomomorphic with respect to proximal stimulation, but in theexamples we have come up with, they are homomorphic withrespect to the distal event as well (perception of a line diawnby a pencil, for example), and so there is no way to decidewhether perception is of the proximal stimulation or of thedistal event. We challenge the motor theorists to proVide anexample in which perception is homomorphic with structure inproximal stimulation that is not also homomorphic with distalevent structure. These would prOVide conVincing cases ofproximal stimulation perception.

8Bregman (1987) considers duplex perception to disconfirm his"rule of disjoint allocation" in acoustic scene analysis bylisteners. According to the rule, each acoustic fragment isassigned in perception to one and only one environmentalsource. It seems, however, that duplex perception does notdisconfirm the rule.

9Using a more sensitive, AXB, test, however, we have found thatlisteners can match the metal door chirp, rather than a woodendoor chirp, to the metal door slam at performance levelsconsiderably better than chance.


Recommended