The Haskins Laboratories' Pulse Code Modulation (PCM) System · The Haskins Laboratories' Pulse...

Haskins Laborawries Status Report on Speech Research1990, SR-103 / 104, 125-136

The Haskins Laboratories' Pulse Code Modulation (PCM)System

D. H. Whalen, E. R Wiley, Philip E. Rubin, and Franklin S. Cooper

The Pulse Code Modulation (PCM) method of digitizing analog signals has become astan~ard both in digital audio and in speech research, the focus of this paper. Thesolutions to some problems encountered in earlier systems at Haskins Laboratories areoutlined, along with general properties of AID conversion. Specialized features of thecurrent Haskins Laboratories system, which has also been installed at more than a dozenother la?oratories, are al~o detailed: the Nyquist filter response; the high frequency pre·emphasIs ?lter charactenstics; the dynamic range; the timing resolution, for single and~synchro~llzed)dual channel signals; and the form of the digitized speech files (headermformation, data, and label structure). While the solutions adopted in this system are notintended to be considered a standard, the design principles involved are of interest tousers and creators of other PCM systems.

INTRODUCflONThe Pulse Code Modulation (PCM) system of

digitizing analog waveforms, in which amplitudesamples are taken at frequent, regular intervals,can accurately represent continuously varyingsignals as binary digital numbers (cf. Goodall,1947). In the years since its introduction, PCMhas become the standard technique for the digitalsampling of analog signals for research purposes(in preference to such alternatives as deltamodulation or predictive coding of various sorts;cf. Heute (1988)). PCM systems are now availablefor almost any computer, and the recordingindustry's digital CD's have surpassed analogformats in sales.

Although PCM systems are now commonplace,this has not always been the case. When HaskinsLaboratories needed an interactive, multi-channelsystem in the mid 1960's, such systems simply

The writing of this article was supported by NIH contractN01·HD.5.2910 to Haskins Laboratories. We thank MichaelD'Angelo, Vincent Gulisano, Mark Tiede, Ignatius G.Mattingly, Patrick W. Nye, Tom Carrell, David B. Pisoni, andtwo anonymous reviewers for helpful comments. We also thankLeonard Szubowicz for the time and care spent designing andimplementing the original version of the Haskins PCMsoftware.

125

were not available. A design was devised, andimplemented in an unconventional way, to meetthe needs of our researchers. Much of our speechresearch at that time was concerned withperceptual responses to different words or syllables arriving at the two ears simultaneously orwith small temporal offsets. Stimulus tapes forsuch experiments could be made by tape splicing(a separate tape for each ear) and rerecording thesignals onto a dual-track tape, but the methodwas both error-prone and laborious. Moreover,each change in stimulus condition-differentpairings of the overlapping words, differences inrelative onset time, or in relative level-requireddoing the whole job over again. Hence, the designobjective was to store all the stimulus words in thecomputer, then convert them back to analog form,and bring them out in real time to a listener or, inthe usual case, to a dual-track recorder inwhatever combination of stimuli, offsets and levelsthe experimenter might choose.

The system that resulted is still in use, but itsvery singularity makes it mostly of historicalinterest. Certain aspects of that system, however,are incorporated into newer systems based oncurrent, commercially available hardware.These newer systems are in place at HaskinsLaboratories and at more than a dozen other sites

126 Whalen et al.

in the United States and abroad. These will bedescribed in detail so that current and futureusers of Haskins-based systems can have easyreference to them, and so that designers of othersystems can see the reasoning that went into thechoices made. The basic principles of AiDconversion will be outlined along the way.

Early Problems and SolutionsIn 1964, when the earliest PCM system at

Haskins Laboratories was begun, the challenge forour designers was simply to create a system wherenone could be bought. Although PGM was commonin telephony, there were no commercial systemsavailable for programmable computers. Wetherefore designed a system to be interfaced witha Honeywell DDP24 computer (and later on aDDP 224) with 8K of memory. Although only brief'stretches of speech could be digitized directly intocore memory, double buffering allowed the systemto deal with continuous speech input; that is, theincoming digital stream was stored alternately inone of two buffer areas of memory while earliersamples were being read out from the other bufferand written to digital tape. For output, 2.8seconds of speech could be called up at will,directly from core memory. Longer sequencescould be compiled onto digital tape, and then readoff' from the tape in near-real time. For twochannel synchronized output, the samples storedon the tape alternated between the two speechchannels. Later, faster disks became available, sothat long, one channel sequences could be outputwithout going to tape. The same might havehappened for the two-channel output, except thattechnology passed this system by, and itdisappeared when the DDP 224 was liquidated forits gold content in 1982.

The next challenge was to meet the growingdemands of an increasing research field by addingmore channels which could access a set of commondisks, avoiding both the recording on digital tapeand the limitation to one user at a time. Theresult was a multi-channel PCM system, whichwas designed by Leonard Szubowicz, Rod McGuireand E. R. Wiley and implemented with thecollaboration of Richard Sharkany. It consists (itis still in use) of four output channels and twoinput channels, controlling DMA (Direct MemoryAccess) boards and filled continuously in FIFO(First In, First Out) circuits. Memory is dynamically allocated to each active channel; the amountis trimmed back as other requests come in, orexpanded as other channels become inactive. Theadvantage of this memory management is that

large memory areas make the rare FIFOshutdown (i.e., data did not arrive in time) evenrarer. The advantage of FIFO organization is thatbuffers can be filled With less concern for timecritical disk accesses. A drawback is that thesystem does not know exactly where in the outputit is, since only the DMA has that information, sothat the controlling computer cannot receive anexact reading ofhow far the sequence has gone.

While the speech waveform is the primarysignal of interest at this laboratory, other analogsignals such as the output of transducersmeasuring the speech articulators and muscles(electromyographic (EMG) signals) are also used.Many such signals are more restricted in thefrequency domain, and thus can be representedadequately at slower sampling rates. The lowerthe rate, the less disk space is used. Even forspeech, some purposes are well-served by the 10kHz rate, while others need the informationbetween 5 kHz and 10 kHz which is preserved atthe 20 kHz rate. Each of the six channels can beused at a 10 kHz or 20 kHz sampling rate. Oneinput channel and one output channel alsosupport the rates of 100, 200, 500, 1000, 2000,4000,5000, 8000, and 16000 samples per second.If necessary, these two channels can be connectedto an external clock which can be run at any rateup to 50kHz.

When the system was designed, computer memory was quite limited, so the simplest, memoryintensive solutions to real-time output were notavailable. To obtain a large throughput from asmall system, our design undid the major advancein computation, von Neumann's use of data andinstructions in the same area. Given the small address area of our platform, the PDP 11104, therewas very little room to write a program and extremely little left over for data. To overcome thislimitation, additional memory was attached, eventhough the processor could not access it. However,the DMA's could, and the program was capable oftelling them how to do so. In this way, an adequate amount of memory was available to sustaina throughput of about 40,000 data samples persecond, divided among up to four channels.

A continuing concern was the synchronization ofany two PCM channels. This was accomplished bysetting any two channels to wait for the sameclock. When the clock is started, the two channelsbegin at exactly the same time. The primary goalof this feature was the easy creation of stimulustapes for dichotic listening procedures (Cooper &Mattingly, 1969). It also allowed the simultaneousinput of two analog channels, e.g., speech and

The Haskins Laboratories' Pulse Code Modulation (PCM) System 127

laryngograph. Further, an output and inputchannel could also be synchronized, allowing forsuch features as resampling a file with differentcharacteristics (such as sampling rate).

Sometimes, it is convenient to have an arbitraryaudio signal which marks the passage of a certainportion of the main signal. An example is the useof a tone to start a clock for a timed response froma subject. While this could be accomplished byhaving a second, synchronized channel outputtingsuch a "mark tone," the lack of variation in thesignal allows for a simpler solution, and one which

. would allow marktones to accompany two-channeloutput. Each output channel is thus associatedwith a marktone channel, which allows the outputof an unvarying audio' signal (a 1 kHz tone, in thiscase) without any increase in processing load.Whenever a sample is output which has thesecond highest bit set, a 1 kHz tone 4 ms induration is simultaneously output on themarktone channel. This tone can be recordedalong with the main signal, allowing (for example)the synchronization of the main signal with otherdevices, such as a reaction timer. Since themarktone is essentially part of the data stream, itdoes not impose any further load on the system:The second highest bit is part of the 16-bit wordthat is stored in the computer, but not part of the12 bits of data. Thus, marktones can be freelyintermixed with either or both channels ofsynchronized output.

While the PCM system just described is still inuse at Haskins Laboratories, it is no longer theonly system in use there. Input and output (AIDand D/A) boards from Data Translation, Inc., havebeen added to several VAXstations (from DigitalEquipment Corp., or DEC) and made compatiblewith the file and data formats from the oldersystem. Such features as the file format, thesynchronization of channels, and the characteristics of the filters have been maintained. Sowhile the convenient features of the old systemcan be included in the new systems, thesesystems, unlike the original, can be duplicated atother laboratories.

Computer EnvironmentsThe main Haskins PCM system, with its four

output channels and two input channels, consistsof a PDP 11104 (Digital Equipment Corp.) whichshares disks with a VAX 111780 (DEC) via a LocalArea VAX Cluster. These disks contain thecomputer files which store the digitized samples ofthe PCM system. The VAX and the 11104 communicate via two 16-bit parallel programmable I/O

interfaces. Control parameters, such as disk addresses and start or stop signals, are passed fromthe VAX to the 11104, and status words are passedback to the VAX. When input or output is beingperformed, the 11104 has priority on the disks,allowing it the best chance of completing its timesensitive tasks. For both input and output, thedisk files must be contiguous, rather than beingspread across several segments as an ordinary filewould be. If the file were not contiguous,computing an address for a file extension andrepositioning the heads would often take longerthan the amount of time used to output the dataobtained on the previous disk access.

The newer systems use Data Translation AIDand D/A boards installed in MicroVAXes orVAXstations. In contrast with the older system,the PCM data must pass through the main CPU.This requires the process performing the input oroutput to be set to real-time priority, but does notautomatically exclude other jobs from running onthe computer. Having only the PCM job, however,reduces the chance that the data cannot be readoff the disk within the time allowed. Also unlikethe older system, the new systems support only asingle user. And though there are two outputboards on most of the new systems, they bothdemand the same CPU resources, so only onesignal, or two synchronized signals, can beprocessed at a time.

Dynamic RangeDynamic range is the ratio of the maximum to

minimum amplitude difference in the signal whichcan be accurately represented. Thus, the primarylimitation on this is the number of bits ofresolution used for representing the data. TheHaskins PCM format for data consists of 12 bits ofdigitization, which can represent 4096 distinctvalues. These are stored in 2 byte (16 bit) words,with the upper four bits, the ones not used fordata, contain output control information (see§ 6.2). 16 bit systems are quite common, and formthe basis of digital audio systems. 8 bit systems,which can represent 256 distinct values, are usedin many personal computers, but they do not haveadequate resolution'for many research purposes.The coding itself is simply a binary representationof the quantized voltage. Most systems, includingthe Haskins one, avoid having a sign bit by addinga dc offset half as large as the dynamic range. Fora 12 bit system, this means that the originalrepresentations of -10 V to +10 V as -2048 to 2047will be stored machine-internally as valuesranging from 0 to 4095. (Thus the dynamic range

128 Whalen et al.

is, more accurately, -10V to +9.995 V, since onevalue of the coding scheme must be used for zero,thus leaving the range one value off center; for therest of this paper, the value +10 V will be used,even though 9.995 V is meant.) In the Haskinssystem, each value is represented as a 16-bitnumber.

With a 12 bit system, the theoretical dynamicrange is 72.2 dB. This is calculated from theformula 20 log 2n , where n is the number of bits inthe system. Conveniently, this reduces to 6.0206n.Machine-intemal noise effectively reduces this byone bit, yielding a more realistic estimate of 66.2dB. By contrast, a 16 bit system has a theoreticalrange of 96.3 dB, and an 8 bit system, 48.2.

When digitizing, the system cannot differentiatebetween signals which reach the upper or lowerquantization limits and those which exceed themand thus fall outside the dynamic range. Any signal which exceeds either of the limits will therefore be truncated to the limiting value, resultingin "peak clipping." While the clipping of a singlesample will have relatively benign consequences,many successive peak clipped samples will resultin an obnoxious noise and unreliable frequencyanalysis of the clipped region. The only remedy forpeak clipping is avoiding it by re-inputting thesignal at a lower level.

Any PCM system has inherent limits on the sizeof differences in the input voltage that can berepresented accurately. Analog values which fallwithin the range of one bit will be given a singledigital value. The divergence from the originalsignal due to these limits is called "quantizationerror." Since the voltages of -10 V to +10 V arecovered by 12 bits in the Haskins system, thequantization error is 4.88 mV (or 0.0244%) forsignals using the entire dynamic range. For lowamplitude sounds using less of the dynamic range,the quantization error will be larger, in terms ofpercent.

Timing ResolutionThe frequency at which the system examines

the analog signal and codes it into a digitalnumber is the "sampling rate." This rate imposesa limit on the frequencies within the originalsignal which can be accurately represented. Ifthere is an input signal which has a frequencyhigher than half of the sampling rate, its sampleswill be indistinguishable from those of a lowerfrequency signal. This shift in apparent frequencyis called "aliasing," and the frequency above whichthe effect occurs is called the Nyquist frequency(see § 6.1).

The sampling rate also imposes limits on theaccuracy of frequency measurements for someaspects of the speech signal-formants and, mostnoticeably, the fundamental frequency (FO). For afile sampled at 10 kHz, an FO of 100 Hz will belimited in accuracy to +/- .5%. While this isusually quite acceptable, there are times whengreater accuracy is desirable. For higher FO's,however, the error due to temporal quantization ismuch larger. For a typical female FO of 200 Hz,the accuracy is +/- 1%, and for a high (but notexceptional) child's FO of 500 Hz, it goes to +/2.5%. All these figures can be cut in half for filessampled at 20 kHz, but even +/- 1.3% is variableenough to obscure some effects. The most clear-cutinstance in which these differences becomeimportant is in the measurement of vocal jitter(e.g. Baken, (1987), pp. 166-188). That is, thedifference in FO between adjacent pitch periods.Here, the differences add up, because a halfsample excluded from one period will be addedinto the next, increasing apparent jitter, whenthere may in fact be none. The cost of higheraccuracy, in this case, is the larger storage spacerequired. Doubling the sampling rate doubles theamount of disk storage needed.

Another timing relationship is that between twochannels which are started at the same time. Forsynchronized channels in the Haskins system,whether on input or output, the time differencebetween the two channels is nonexistent. Bothchannels read the same clock, and thus they bothstart at exactly the time that the clock starts.When digitizing, there is a minuscule amount ofamplitude decay for the second channel, since thesignals will be read off the sample-and-holdcircuits after the 20 microseconds it takes for thefirst channel to perform its coding. However, sincethe decay for these circuits is measured inseconds, and the coding occurs at a delay which isconsiderably less than half of the sampling rate,the reduction in amplitude is truly negligible. Theimportant fact is that the two channels aretriggered at exactly the same time, rather thanhalf a sample apart.

The absolute simultaneity of the two channelshas been preserved in our more recent systemsbased on commercially available boards. The inputand output boards from Data Translation, Inc.have two channels available on each, but oursystem ignores the second channel and uses asecond board instead. One consequence is that thetwo channels are completely simultaneous ratherthan slightly offset, as they are when the twochannels of one board are used. A more practical

The Haskins lAboratories' Pulu Code Modulation (PCM) System 129

consequence is that the samples from the two filesdo not have to be interleaved as they are read intomemory. This saves a considerable amount ofoverhead for the system, allowing a much moreflexible approach to the capture and presentationof simultaneous signals. Files of any length can beplayed together in any combination with no moreprocessing time than for a single file.

Filter CharacteristicsEvery analog signal that is to be digitized, and .

every conversion of a digital signal into an analogone, benefits from the use of filters. Unfiltereddigital output can produce severe "digitizationnoise," due to the sharp edges of the pulses thatare produced by the digital samples. On input,frequencies which cannot be accurately represented must be filtered out so that they do notcontaminate the signal with aliased sounds (seethe end of the previous section). (Even if we arenot interested in the nature of the signals abovethe Nyquist frequency, they must be filtered out toavoid contaminating the spectral content belowthe the Nyquist frequency.) Since the limit iscalled the Nyquist frequency, the filters are calledNyquist filters.

A more specialized filter, which aids in therepresentation and analysis of high frequencysounds, is the high frequency pre-emphasis filter.

In creating a PCM file, the combination of filtersto be used is specified in the program, and thatcombination is stored in the header of the new file.

For outputting a PCM file, the program determines the appropriate filters based on informationin the file header. Once these are selected, theycannot be changed. Resetting the filters usuallyresults in an audible click, which would beunacceptable in the midst of an output.

Nyquist FiltersThe filters that Haskins systems use to

eliminate frequencies above the Nyquist frequency. are hardware filters with the response shown in

Figure 1. Components below 4.8 kHz (or 9.6 forthe·20 kHz system) emerge with only minorreduction in amplitude, while those above are areseverely attenuated. At 5 kHz (or 10 kHz), theattenuation is at a maximum, approximately 50dB. Most filters are described in terms of thenumber of db per octave that the attenuationattains. Since the attenuation here is accomplished in much less than an octave, it ismisleading to describe this cutoff in a db/octaveformula. Stated in those terms, these filters havea 1200 db/octave attenuation, which is over 16times larger than the entire dynamic range. Sinceit is theoretically impossible to attenuate a signalmore than the dynamic range allows, this numberis impossibly large. Instead, the filters should bedescribed as sharply tuned and reaching the 3 dbattenuation level at 4.8 (or 9.6) kHz. In any event,the sounds above the Nyquist frequency havevirtually no chance of affecting the signal anymore than the background noise does.

o+-----.------------- ...·0-0-0000000?

-10

~:g -20-

100001000

-- , 0 kHz rate

~- 20 kHz rale~

"'0=....:s. ·30

~..oj

-50+-------....................................~------.....- ........_-~100

Frequency (Hz)

Figure 1. Resultant amplitude of 0 dBM test signals of differing frequencies after passing through the Nyquist filter.Measurements shown are for one system, but similar results obtain for other Haskins systems.

130 Whalen et al.

High Frequency Pre-emphasis Filters

For signals such as speech which are primarilydriven by low frequency sources, the highfrequency components generally have loweramplitude than the low frequency ones. Of course,high frequency signals of a given amplitude, beingmore intense, will sound louder than lowfrequency signals of the same amplitude, so thatin a sense the high frequency signals are moreperceptually salient than their amplitude wouldsuggest. Nonetheless, early researchers found thatthe high frequencies, especially of speech, weredifficult to measure or even detect when input attheir natural level. In order to rectify thissituation, a hardware filter was selected whichcould boost the high frequencies (beforedigitization) by a reliable and known amount. Acomplementary filter could then reduce theiramplitudes by the same amount when thedigitized signal was played out. There is a slightgain in accuracy of the digitization, since thequantization error will be a smaller proportion for

a signal which uses more of the dynamic range.For the If! noise to be examined in Figure 3, forexample, the quantization error is about 0.488%for for the non-pre-emphasized signal while it isabout 0.029% for the pre-emphasized signal.Although this difference is sizable, the improvement in quality may not be very noticeable to thenaked ear [though see Whalen (1984) for ademonstration of perceptual effects of differencesthat are not consciously detectable].

Figure 2 shows the pre-emphasis function usedwith the 20 kHz sampling rate. The response isfairly linear up to 1 kHz, then rises exponentially,shown as a straight line in Figure 2, wherefrequency is represented in a log scale. On output,a filter with exactly the reverse characteristics isused. Thus if the amplitude value is read as adecrement, this figure can be used to representthe de-emphasis filter as well. The same filter is

. actually used for the 10 kHz rate, but since theNyquist filter (which in this case functions as ananti-digitization noise or "anti-imaging" filter)follows it, there will be nothing left above 5 kHz.

30

-~ 20~"0-Q)

::"0.......-~

10~

100001000o -6. .....:::::::::.-....--~--...-,;...,........,...............-....,..----...--..----.---..-....--.-.....-.-,

100

Frequency (Hz)

Figure 2. Resultant amplitude of 0 dBM test signals of differing frequencies after passing through the high frequencypre-emphasis and Nyquist filters. Symbols represent measurements for one system, and the line is a fitted polynomial.Because of the Nyquist filter, the output level drops steeply at 10 kHz (not shown).


Ideally, the pre-emphasis filter should equalizethe long term speech spectrum so that themaximum use of the dynamic range is achievedfor each frequency region. Clearly, no one filtershape can serve this function, since differentspeakers, and even the same speaker at differenttimes, will generate different long-term spectra.The shape of the pre-emphasis function is acompromise based on the sorts of long termspectra encountered in the early research. Thefunction is not based on properties of the humanauditory system, though it bears a superficialresemblance to the ear's increase in sensitivitybetween 1500 and 4500 Hz (e.g., Robinson &Dadson, 1956). There is also some resemblance tothe historically later Dolby noise reductionsystems. Although Dolby systems have becomestandard in the recording industry, there are goodreasons not to use them as part of a PCM system.While the Dolby system greatly increases theseparation of low intensity, high frequency signalsfrom the noise encountered on playback fromaudio tape, it would be inappropriate to use it as afront-end to a digitizer, since digitized signals arenot subject to media noise. (Even for signals whichare simply recorded on audio tape for laterdigitization with a PCM system, Dolby noisereduction may be inappropriate. The net effects ofthe Dolby filters may be benign in terms ofintelligibility, but finer acoustic measurements,e.g., the bandwidths of formants which happen tolie at the edge of one of the four Dolby bands, maybe affected. In addition, having the tape noise at aconstant level makes it easier to take into accountwhen comparing the amplitude of speech sounds.Reducing the tape noise for high frequency soundswould reduce their amplitude compared with lowfrequency sounds which included the noise.)Similarly, there are digital techniques such asfirst-differencing which can have similar effectswithout requiring the hardware filters. However,such digital filters are neither sharp enoughnor linear enough for many of the measurementsthat are made in the speech field. So, forconsistency and reproducibility, the hardwarefilter approach has the most benefits. This systemdoes have the drawback that the PCMrepresentation of these signals cannot be playedback faithfully on other systems unless the othersystems have the same filter. (They can be playedback without the de-emphasis filter, and thespeech is usually quite recognizable, just distortedby the additional amplitude in the highfrequencies.) For many purposes, such representations are adequate.

Figure 3 shows the effect of this pre-emphasisfiltering system. In the top panel is the waveformof the word "fast," with the high frequencies preemphasized. The characteristically weak If!fricative noise is easy to discern in the first 100ms. In the bottom panel, exactly the same signal(input synchronously on the second input channel)is shown in its non-pre-emphasized version. Theonset of the If! noise is very difficult to discern atthis level of resolution. The middle panel of Figure3 shows the result of magnifying the display of thebottom panel by a factor of three. The shape of thefricative noise is now somewhat clearer, thoughthe gradualness of the beginning of the noise isstill somewhat hard to make out, but the vocalicsegment (jre!) is now (visually) peak-clipped. Alongwith the fricative noise, the low-frequency, dc airflow noise can also be seen. Such information isuseful for recognizing less than optimalrecordings, but it is not part of the speech signal.With pre-emphasis, the shape of both the fricativenoise and the vocalic segment are evident, andthere is no need to use separate magnifications tomake them so. While the If! noise could havetolerated much greater pre-emphasis, the lsi noise(around 375-450 ms), which also contains highfrequencies, could not.

Pre-emphasis is not without its cost in other regards, however. Although the frequency analysisof the high frequencies is more accurate, theamplitude values of those frequencies relative tolow frequency components are inflated. While theamount of change is predictable, it is not terriblyconvenient for humans looking at the display tocalculate. When many comparisons of, say, theamplitude of F4 to that of F1 are to be made, preemphasis is definitely a drawback. If F5 is inquestion, however, it may be that the structure ofthe formant itself is not discernible without thepre-emphasis, so that the translation of theamplitude is a necessary evil. Such comparisonsare relatively rare, however, and most researcherstake advantage of the greater resolution in preemphasized digitization.

One other cost deserves mention, since it hasalready caused a certain amount of confusionin the literature (Fowler, Whalen, & Cooper, 1988;Howell, 1988; Tuller & Fowler, 1981).In that work, the amplitude of various speechsignals was equated without the completedestruction of the speech information by atechnique called infinite-peak-clipping (Licklider& Pollack, 1948). For each sample of the signal,positive values are amplified to the maximumlevel and negative values to the minimum.

132 Whalen et al.

+10 With pre-emphasis

-10

+10 Without, magnified by 3

x 3. 00

Without pre-emphasis

-10

-10

+10

:;-c:

o 100 200 300 400 500

Time (in ms)

Figure 3. Waveforms of the word Ufast" under two sampling and two display condItions. The top and bottom panelsrepresent the syllable with and without pre-emphasis, respectively, at original amplitude. The middle panel is thenon-pre-emphasized signal magnified by a factor of 3.

The result is an irritatingly noisy, though usuallyrecognizable, utterance. If the original file waspre-emphasized, however, it would normally gothrough the de-emphasis filter. When outputthrough the de-emphasis filter, the highfrequencies are lowered in amplitude, so thatsignals with different frequency componentswould once again have different amplitudes,despite the infinite-peak-clipping. If the deemphasis filter is avoided (which can be done bychanging the PCM file header), the intendedresult is obtained even for pre-emphasized files.(The pre-emphasis filter rarely changes the sign ofa sample, though it can happen when an intensehigh frequency sound occurs with a simple lowfrequency sound.)

Another technique, which results in a soundcalled "signal-correlated noise" (Schroeder, 1968),interacts with the pre-emphasis function. Signalcorrelated noise retains the amplitude contour ofthe source sound but has a flat spectrum. Thesamples of approximately half the digitized sourcehave their signs changed at random while themagnitude remains the same. The overall energyremains the same, since the same amount of

deviation from the baseline is present. But, sincethe direction the wave takes is randomly relatedto its original direction, the spectrum of the signalis flat. For a pre-emphasized original signal,however, the spectrum of the signal-correlatednoise is flat only machine-intemally. If the noisepasses through the de-emphasis filter, the highfrequencies will fall off by the amount specified inFigure 2. This does not restore any of the spectralstructure of the original, but the spectrum is notperfectly flat either. Avoiding the de-emphasisfilter will not salvage the noise, since that wouldmaintain the flat spectrum but change theamplitude contour. For sounds which are going tohave signal correlated noise stimuli created fromthem, a non-pre-emphasized original is preferable.Altematively, a brief description of the deviationfrom a flat spectrum (the high frequency roll-oft) isnecessary.

Haskins PCM File FormatsThe information in this section is quite detailed,

and will be of interest primarily to users of theHaskins system. The kinds of informationincluded, though, may be of interest to users of


zero-filled. Physiological files contain moreinformation (see below).

-10 volts is encoded to 0ovolts is encoded to 2048

10 volts is encoded to 4095

Thus, a 16-bit bipolar digital value that rangesfrom -2048 to 2047 can be obtained by subtracting2048 from the 12-bit encoded sample value.

To conform with the conventions used by theAID and D/A converters at Haskins Laboratories,the signal voltage levels are encoded digitally inexcess-2048 form, that is:

descriptiondata fieldif set, data field is an inter-stimulus-intervalvalueif set, something is wrongif set, a mark tone will be generated a thatsampleif set, something is wrong

1415

16

bitposition1 - 12

13

PCM DataThe PCM data begin in the first block

immediately following the header block. Samplesare stored as fixed length 128 byte records of 64words, and are usually input into contiguous files,though the files do not have to be contiguous foranalysis programs which do not do real-timeoutput. To output a section of a sampled data filewith the older system, it must be contiguous. Thenewer systems can read noncontiguous files intomemory sufficiently fast to keep the real-timeoutput going.

One 12-bit sample is stored in the low order bitsof each 16-bit word. This 12-bit sample representsa bipolar analog voltage that ranges fromendpoints set near -10 and 10 volts. The four highorder bits in each 16-bit word form a control fieldthat is utilized by the audio output system. Whensamples are read for analysis within thecomputer, this control field must be cleared beforesubtracting the midline. That is, if one of thecontrol bits is set, it will appear to the generalcomputer as a legitimate part of a number, eventhough it would be far outside the dynamic range.Normally, these bits should also be cleared whensamples are written ,out to a PCM file. Programsthat generate speech files must truncate thesamples to avoid overflow into the control field.

The following is the format of the data word:

other PCM systems. The format of digitized filestakes advantage of the special features of theHaskins PCM hardware (such as marktones) andof in-house programs (such as the labels of thewaveform editor WENDY). For third partysoftware, modifications are required. For ~ample,the ILS package of Signal Technology Inc. is alarge set of programs for doing signal analysis, Bydefault, these programs expect a header format inPCM files that is contains some of the sameinformation as Haskins headers but puts them indifferent locations. The input and output routineshave been changed so that ILS can put itsinformation at an otherwise unused part of theheader, leaving the rest in the Haskins format.Another alternative that is employed by somenewer Haskins programs is to translate from oneheader format to the other, and create twoversions of a file if needed.

These features will be discussed in the order inwhich they appear in the computer file. The firstcomponent of the file is a header block of 512bytes, which contains information about thecharacteristics of the data. The next is the dataitself, taking up as many 512 byte blocks as areneeded to accommodate the number of samples inthe file. The final, optional portion is a section ofup to four trailer blocks containing labels oflocations within the file. (This label format is inthe process of being superseded by separate labelfiles.)

The conventions presented here are notintended as a standard (cf. Mertus, 1989), sincethere are many concerns which are not adequatelyaddressed by this format. Just to give oneexample, there is currently a word in the headerto indicate the number of bits of resolution(always 12 for current Haskins systems), but thisformat may not be optimal for a more broadlydefined standard. The present discussion isintended to make the information more accessiblefor those laboratories which do use the formatalready, and to bring the Haskins conventions tothe attention of those devising their own systems.

PCM HeadersThe initial portion of each PCM file consists of a

"Header" which contains attributes of the sampleddata within the file. For some files, especiallythose from the Haskins Physiological SpeechProcessing (PSP) system, the header alsoestablishes a correspondence between time andsample position within the file. The first file blockof the PCM file (512 bytes on DEC systems) isused, though for speech files much of it is simply

134 Whalen et al.

Haskins PCM LabelsLabels are used to record the position, and

optionally the range, of user-defined portions ofthe PCM file. Each label consists of a string ofalphanumerics (beginning, by convention, with aletter) which is a file-unique name for ihe label; alocation, given in milliseconds from the beginningof the file; a left range and a right range, whichcan be set in terms of milliseconds in relation tothe label; and a code to determine whether thereis a mark tone or not.

The length of a single label is 32 bytes. Theolder style maximum number of labels was 64. (Inthe older style of programs, labels were stored intrailer block(s) of the PCM file immediatelyfollowing the data blocks within the file.) If thereare old-style labels stored in the file, the numberis contained in a field in the header block (word 7).Many of our own programs currently changeautomatically from old to new style any time aPCM file is accessed.

The old format for labels in a Haskins PCM file:

byte lengthposition ~ description

I 4 label left range

5 4 label right range

9 4 label location (time vIDueof label)

13 1 label mark tone flag

14 19 name of label

The unit for time representation is one 20,OOOth ofa second, and the scope of a label is defined toextend from its time value minus its left range, toits time value plus its right range.

The new format consists of separate ASCII filescontaining label information coded by keywords, ofwhich many are common but some are specific toone program. This allows for greater flexibility inthe number of labels that can be maintained,convenient correction or even creation of labelswith a text editor, and compact sharing of labelsacross several related files (such as physiologicalmeasurements of one event which might end up ina dozen different files). The implementation of thissystem is in progress, and eventually it will be theonly one used by Haskins programs.

SummaryThe Haskins PCM system is a combination of

standard techniques and unique features. Copieshave been built with custom-made hardware and,more recently, with commercially availableboards. Some salient features are: convenientinput and output of signals of any length(dependent on the system's disk rather than onthe PCM system constraints); exactly simul~

taneous synchronization of two channels (eithertwo output, two input, or an input and an output)without the need for interleaving the samples;consistent pre-emphasis of high frequencies foreasier analysis, and converse de-emphasis foraccurate reproduction; the capability of havingany number of marktones associated with a filewithout any added load on the system. Thissystem has been used in generating the data fordozens of papers over the last twenty years, andwill continue to be used both at HaskinsLaboratories itself and at the growing number oflaboratories which are using the system.

REFERENCESBaleen, R. J. (1987). Oinical meIlSumrrent ofspeech and ooia. College

Hill: Boston.Cooper, F. 5., &: Mattingly, 1. G. (1969). A computer-<ontrolled

PCM system for the investigation of dichotic speech perception. JaunuU of the Acoustical So~ty of AmericR, 46, 5115 (A).

Fowler, C. A., Whalen, D. H., &: Cooper, A. M. (1988). Perceivedtiming is produced timing: A reply to Howell. Perception &Psychophysics, 43,94-98.

Goodall, W. M. (1947). Telephony by pulse code modulation. TheBell System Technical JounuU, 26,395-409.

Heute, U. (1988). Medium-rate speech coding-trial of a review.Spach Communication, 7,125-149.

Howell, P. (1988). Prediction of the P-center location from thedis~butionof energy in the amplitude envelope: 1. Perception& Psychophysics, 43,90-93.

Ucklider, J. C. R., &: Pollack, 1. (1948). Effects of differentiation,integration and infinite peak clipping upon the intelligibility ofspeech. JounuU of the Acoustical Society ofAmerica, 25,375-388.

Mertus, J. (1989). Standards for PCM files. Beha11ior ResearchMethods, Instruments, & Computers, 21, 126-129

Robinson, D. W., &: Dadson, R. 5. (1956). A redetermination of theequal-loudness relations for pure tones. British Jaul7llll ofApplied Physics, 7, 166-181.

Schroeder, M. R. (1968). Reference signal for signal qualitystudies. JaunuU of the Acoustical Society of America, 44, 1735-1736.

Tuller, B., &: Fowler, C. A. (1981). The contribution of amplitudeto the perception of isochrony. I-1Jlskins Ulboratories Stlztus Reporton Speech Research, 5R-65,245-2S0.

Whalen, D. H. (1984). 5ubcategorical phonetic mismatchesslowphonetic judgments. Perception & Psychophysics, 35,49-64.

The Haskins lAboratories' Pulse Code Modulation (PCM) System

APPENDIX

135

Information Stored in the Haskins PCM File Headers.There are seven main header entries which occupy the first eight words of the header block. They are:

start #of

12QS..., ~

1 1

2 2

4 1

5 1

start #of

~ ~

8 1

9 2

11 1

12 1

13 1

14 1

15 50

65 1

6

7

66

71

72

73

74

1

1

5

1

1

1

1

description

DATA TYPE INDICATOR, a 1 in this field indicates a sampled data format file thatwill be recognized as such by Haskins software.

SAMPLED DATA SIZE, double precision integer representation of the size of thefile (number of samples). The first word is the low order part of the count

SAMPLING RATE, expressed as samples taken per second

ATTRIBUTES, format of word:

- bit°is the pre-emphasis flag, if 0, the data were pre-emphasized during sampling(the level of higher frequencies were boosted) and should be de-emphasized whenoutput; if 1, the data were not pre-emphasized

- bit 1 is the filtering flag if 0, the data were filtered during sampling at the Nyquistfrequency. if 1, the data were not Nyquist filtered the remainder of the word (14 bits)are unused.

NUMBER OF ADDITIONAL HEADER BLOCKS. No longer implemented.

NUMBER OF LABELS, if greater than zero, then the file contains labels that arestored in the trailer blocks of the file, each label is of 32 byte length. The remaining249 words of the header block code the following:

description

Revision level (indicates which version of the arrangement of information in theheader is used).

Virtual block number of first trailer block (where old style labels are kept.)

Number of trailer blocks (for old style labels).

Data source (currently, either VAX (1) or unknown (0))

Number of bits of resolution (only 12 is implemented)

Source (no longer implemented)

Filler words PSP (Physiological Signal Processing) information

Datel hardware input mode: (0 =EMG data, which is already filtered and integrated;1 = speech, which must be at 10 kHz to be synchronized with physiologicalmeasurements; 2 =LED (usually movement) data, in which the x and y values eachtake up a channel; 3 = Electropalatagraph data, where each word represents theon/off state of the 63 contact points in the false palate.)

Filler words

PSP header version number

Samples per frame

Channel map (a sixteen bit word which serves as a bitmap representation of which ofthe sixteen possible input channels are actually being used)

Data file record size

136

75

77

79

85

101

103

105

107109

129

2

2

6

16

2

2

2

2

20128

Whalen et al.

M calibration constant: Together with the B constant, this allows the machine unitsin the file to be interpreted as physical units. The physical value = M*(sample value)+ B. So M is a scaling factor and B is an offset

B calibration constant

Calibration units (12 characters): A description of the units that result from theapplication of the calibration constants (e.g. "millimeters").

Index file name (32 characters): Name of a file which contains a catalog of thenumber of samples associated with each octal code (a time marker on the analogtape) for all the other PCM files which were created in the same input pass as thisone. This information allows for the compensation for minor speed changes in theanalog tape system.

Smoothing constant: If the file was smoothed (as is usual for EMG signals), this isthe size (in milliseconds) of the base of the triangular averaging filter.

Line up point: location of an event chosen by the experimenter to coordinate thedisplays across PCM files. If PSP header version number = 0, then the line up pointis in samples, if PSP header version number 0 then the line up point is in 1/20,OOOthofa second.

Graphics scaling - Y min

Graphics scaling - Y max

Filler words

Filler words

Note that the set of filler words of the header may contain the ILS header information if the file has been analyzed withthe Haskins-modified version of ILS.

Date post:	24-Mar-2019
Category:	Documents
Upload:	buidung
View:	231 times
Download:	0 times

The Haskins Laboratories' Pulse Code Modulation (PCM) System · The Haskins Laboratories' Pulse...

Documents