+ All Categories
Home > Documents > Three New Speech Coders

Three New Speech Coders

Date post: 06-Apr-2018
Category:
Upload: bharavi-k-s
View: 228 times
Download: 0 times
Share this document with a friend
8
8/3/2019 Three New Speech Coders http://slidepdf.com/reader/full/three-new-speech-coders 1/8 ABSTRACT Many new speech coding standards have been created in the IO-year period 1987-1996. In this article the author reviews the key attributes that determine what coder to select for different applications. The article then focuses on three new speech coding recommendations from the ITU-T, namely G.723.1, G.729, and Annex A of G.729. They provide god coverage for a wide range of applications that have low bit rate requirements (i.e., from 5.3 to 8 kb/s). In addition to bit rate, the article reviews their delay, complexity, and performance. Also briefly reviewed are the history of these standards, and what considerations influenced the requirements each of these coders had to meet. Three New Speech Coders from the /TU Cover a Range of Applications Richard V. Cox, AT&T Labs Research rior to the era of digital communications. sveech was I transmitted and stGed as an analog signa[. Today it can be represented in a digital manner, which allows us to store and transmit it in a more efficient manner. The first speech coder was created almost 60 years ago in the form of Homer Dudley's vocoder. It was put to use for the first secure telephone during World War 11. From then until the early 1970s, it seemed like only the military was interested in speech coding. All of this changed in the next two decades. The telephone networks of the world became digital. 64 kbls pulse code modulation (PCM), designed to transmit tele- phone bandwidth speech, made it possible to maintain uni- form quality for long distance connections. Network operators soon realized that by using 32 kb/s adaptive differential PCM (ADPCM), they could double the capacity of important nar- row bandwidth links, such as undersea cables. By the 1980s we were entering the era of the PC and the cellular phone. Many new applications, such as digital cellular, voice messag- ing, videophones, multimedia documents, and Internet tele- phony, need digital speech coders. Each of these applications has its own requirements; consequently, many new coders were standardized in the 10-year period 1987-1996. Many additional proprietary coders are in use in applications that do not require standardization, such as digital telephone answering machines. This article is about the standard coders and how an applications engineer, confronted with having to select a speech coder, might go about this. In particular, we focus on three new standards from the International Telecommunications Union (ITU), 8 kb/s G.729, G.729 Annex A, and 6.4 and 5.3 kb/s G.723.1. We begin with a dis- cussion of speech coder attributes. Attributes are the most important key to selecting a speech coder. The type of appb- calion determines which attributes are most important and in which areas the engineer has flexibility to choose. Through- out this section we compare and contrast the attributes of these standardized coders. We also include a brief history of the three new TTU standardq. SPEECH CODER TTRIBUTES peech coding attributes can be divided into four categories: S it rate, delay, complexity, and quality. The applications engineer determines which attributes are most important. It is possible to relax requirements for the less important attributes so that the more important ones can be met. BIT RATE Bit rate is an attribute that often comes to mind first when thinking of speech coders. The range of bit rates that have been standardized is from 2.4 kb/s for secure telephony to 64 kb/s G.711 PCM and the G.722 wideband (7 kHz) speech coder. The three new ITU coders have the lowest bit rat es yet standardized by the ITU. Table 1 s a list of standardized speech coders. It includes their bit rates, delays, and date of standardizati0n.l Secure telephony bit rates are among the lowest because they were created for an environment where communications above 2.4 and 4.8 kb/s are no1 reliable. Oper- ation at a fixed bit rate matched to the intended communica- tion channel wa4 the first and most important requirement in creating these standards. In the midrange we find digital cellu- lar speech coders. The first generation had rates between 6.7 kb/s for the Japanese Personal Digital Cellular (PDC) system to 13 kb/s for the Global System for Mobilc Communications (GSM) system. These rates could support cellular systems with three to five times the capacity of the analog systems they would replace. Once again, bit rate was among the most important factors in selecting these coders. However, in each of these instances it was the wireless channel bit rate that was fixed and then divided between the speech coder bit rate and channel coding for error correction and detection. At th e upper range we find the earlier ITU standards, 16 to 64 kbis. For more inform ation on these coders and speech coding in general, the reader is referred to [I]. - 40 0163-6804/97/$10.00 0 997IEEE IEEE Communications Magazine September 1997
Transcript
Page 1: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 1/8

ABSTRACTMany new speech coding standards have been created in the IO-year period 1987-1996. In this article the au thor reviews the key

attributes tha t determ ine wha t coder to select for differen t applications. The article then focuses on three n ew speech coding

recommendations from the ITU-T, namely G.723.1, G.729, and Annex A of G.729. They pro vide god coverage for a wide range of

applications tha t have low b it rate requirements (i.e., fro m 5.3 t o 8 kb/s). In addition to bi t rate, the article reviews their delay,

complexity, a nd performance. Also briefly reviewed are the history of these standards, and w ha t considerations influenced the

requirements each of these coders had to meet.

Three New Speech Coders from the/TU Cover a Range of Applications

Richard V. Cox, AT&T Labs Research

rior to the era of digital communications. sveech wasI transmitted and stGe d as an analog signa[. Today itcan be represe nted in a digital manner, which allows us tostore and transmit it in a more efficient manner. The firstspeech coder was created almost 60 years ago in the form ofHomer Dudley's vocoder. It was put to use for t he firstsecure telephone during World War 11. From then until theearly 1970s, it seemed like only the military was interested inspeech coding. All of this changed in the next two decades.

The telephone networks of the world became digital. 64 kblspulse code modulation (PCM), designed to transmit tele-phone bandwidth speech, made it possible to maintain uni-form quality for long distance connections. Network operatorssoon realized that by using 32 kb/s adaptive differential PCM(ADPCM), they could double the capacity of importan t nar-row bandwidth links, such as undersea cables. By the 1980swe were entering the era of the P C and the cellular phone.Many new applications, such as digital cellular, voice messag-ing, videophones, multimedia documents, and Internet tele-phony, need digital speech coders. Each of these applicationshas its own requirements; consequently, many new coderswere standardized in the 10-year period 1987-1996. Manyadditional proprietary coders are in use in applications thatdo not requir e standardization, such as digital telep hone

answering machines. This article is about the standard codersand how an applications engineer, confronted with having toselect a speech coder, might go about this. In particular, wefocus on three new standards from the InternationalTelecommunications Union (ITU), 8 kb/s G.729, G.729Annex A, and 6.4 and 5.3 kb/s G.723.1. We begin with a dis-cussion of speech coder attributes. Attributes are the mostimportant key to selecting a speech coder. The type of appb-calion determines which attributes are most important and inwhich areas the engineer has flexibility to choose. Through-out this section we compare and contrast the attributes ofthese standardized coders. We also include a brief history of

the three new TTU standardq.

SPEECHCODER TTRIBUTES

peech coding attributes can be divided into four categories:S it rate, delay, complexity, and quality. The applicationsengineer determines which attributes are most important. It ispossible to relax requirements for the less important attributesso that the more important ones can be met.

BIT RATE

Bit rate is an attri bute that often comes to mind first whenthinking of speech coders. The range of bit rat es that havebeen standardized is from 2.4 kb/s for secure telephony to 64kb/s G.711 PCM and th e G.722 wideband (7 kHz) speechcoder. The three new ITU coders have the lowest bit rates yetstandardized by th e ITU. Table 1 s a list of standard izedspeech coders. It includes their bit rates, delays, and date ofstandardizati0n.l Secure telephony bit rates are among thelowest because they were created for an environment wherecommunications above 2.4 and 4.8 kb/s are no1 reliable. Oper-ation at a fixed bit rate matched to the intended communica-tion channel wa4 the first and most important requirement increating these standards. In the midrange we find digital cellu-lar speech coders. The first generation had rates between 6.7kb/s for the Japanese Personal Digital Cellular (PDC) system

to 13 kb/s for the Global System for Mobilc Communications(GSM) system. These rates could supp ort cellular systemswith thr ee t o five times the capacity of t he analog systemsthey would replace. Once again, bit rate was among the mostimportant factors in selecting these coders. However, in eachof these instances it was the wireless channel bit rate that wasfixed and then divided between the speech coder bit rate andchannel coding for err or correction an d detection. At th eupper range we find the earlier ITU standards, 16 to 64 kbis.

Formore information on these coders and speech coding in general, the

reader is referred to [ I] .-40 0163-6804/97/$10.000 997IEEE IEEE Communications Magazine September 1997

Page 2: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 2/8

Page 3: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 3/8

HFigure 1. Relative speech coder quality.

than 0.2 ms. In contrast, the Japanese and North Americantime-division multiple access (TDMA) digital cellular systemsshare a single channel among three users. Thus, the transmis-sion delay is 6.67 mslframe. However, they also interleave two

frames to combat fading; this adds an additional 20 ms to thetransmission delay. The multiplexing delay is time spent wait-ing to transmit the bits, or process speech samples or the bit-stream at the encoder or decoder, respectively. It could beanywhere from 0 to 20 ms at both the encoder and decoder ineither system. In this example, we will consider the ideal caseand assume no multiplexing delay. Excluding multiplex delay,the circuit multiplication system has a delay of 30.2 ms, andthe digital cellul ar system has a delay of 71.67 ms, eventhough both use the same speech coder.

Most speech coders are implemented on DSPs or otherspecial-purpose hardware. However, recent multimediaspeech coders have been implemented o n the host C PU ofpersonal computers and workstations. The measures of com-plexity for a DSP and a CPU are somewhat different, due to

the natures of these two systems. At the heart of complexity isalways the raw number of computational instructions requiredto implement the speech coder. DSPs from different vendorshave different architectures, and consequently different effi-ciencies in implementing the same coder. The measure usedto indicate t he computational complexity is the number ofinstructions per second required for implementation. This isusually expressed in millions of instructi ons per secon d(MIPS). (For some of t he coders in Table 1, here are sometypical MIPS values: G.726 and G.727, 2 MIPS; G.723.1, 16MIPS; G.729, 20 MIPS; G.72 9A, 11MIPS; and G.728, 30MIPS.) DSPs also have high-speed static random access mem-ory (RAM) n-chip for storing variables and data. Usually thisis from 1000 to 10,000 words of RAM. The amount of RAMrequired is a second measure of complexity. (For the codersin Table 1, G.726 and G.727 take less than 100 words ofRAM; the other ITU coders range from 2000 to 3000.) Final-ly, DSPs have on-chip read-only memory (ROM) for storingconstants and the program instructions. Required ROM stor-age is the third measure of complexity. (For the ITU coders inTable 1, G.726 and G.727 use about 1000 words of ROM,while the others are typically in the range of 10,000 words.)For an implementat ion on a PC or workstation, the numberof instructions per second is the only relevant measure. Gen-eral-purpose computers have far more RAM (although it isusually slower-speed, less expensive dynamic RAM). Both theRAM and ROM from the DSP implementation will be storedin RAM to run the program on a computer. However, gener-al-purpose computers tend to have less efficient processor

architectures for implementing digital signal procesbingalgorithms. Consequently, 10 MIPS on a DSP may requirefar more cycles on a computer.

Quality is the category that has the most dimensions [2].When speech coding was only used for secure telephony,quality was synonymous with intelligibility. T he firstrequirement for secure speech communications was thatthe decoded speech be intelligible; otherwise, its security

was not going to help. The earliest teleph one networkspeech coders operated on a sample-by-sample basis. Theirquality was more or less directly related to their signal-to-noise ratio (SNR) in quantizing the signal samples. Intelli-gibility was not even an issue. An SNR high enough forspeech was also high enough for other audio signals thatmight also be encountered, such as background noise ormusic. However, at the low bit rates used for secure tele-phony, the speech was coded based on a speech productionmodel. This model was not capable of modeling music orthe combination of speech plus some other signal. The

result was that when the input speech contained backgroundnoise, the performance of the speech coder degraded signifi-cantly. This problem is known as robustness to backgroundnoise. Much of the research in the past decade has been on

trying to make low-bit-rate speech coders more robust toextraneous noises and signals. This problem of robustness fornoisy input speech has been observed for all speech codersbelow 16 kbls.

In th e previous paragraph we already touched on threeaspects of quality: a single encoding of clean speech, noisyspeech, and intelligibility. Others include performance formultiple encodings, two different coders in tandem, input sig-nal levels that are either too high or too low, across a widevariety of voices and languages, and over noisy channels; theability to carry network signaling tones, and to carry voice-band mode m and fax signals. The list could go on, but thereader should be able to tell that there a re a great manydimensions to speech coder quality and performance thatneed to be investigated. One of the advantages of the ITU is

that it includes the Speech Quality Experts Group. These arethe foremost experts in the world at measuring speech qualityand determining whether performance should be sufficient fora given application. This has allowed the ITU to conduct themost extensive speech quality testing of any standards body.For the most part it means that the ITU speech coding rec-ommendations have had extensive testing, and their perfor-mance is the best understood.

A good example of the significance of comprehensive test-ing is what has happened with the first generation of digitalcellular speech coding standards in Europe and North Ameri-ca. When these standards were first created in the late 1980s,they were considered a significant milestone for speech cod-ing. Hundreds of billions of dollars would be spent on thesesystems, and the fact that they included speech coders was anindication of the maturity of the field. However, none of themwere tested for robustness to noisy input speech. The first-generation systems were designed for car phones, butlightweight handheld phones quickly overtook car phones asthe terminals of choice. These lightweight phones have themicrophone significantly farther from the lips of the talker.This means a poorer input SNR for the speech signal. Tocompound thi3, portable phones and car phones are beingused in far noisier environments than a traditional office orhome telephone. Consequently, the quality of t he inputspeech signal is far from ideal in most instances. The first-gen-eration digital cellular speech coders were not well receivedby the public because the ir speech quality was judged to beinferior to the analog frequency modulated (FM) systems they

42 IEEE Communications Magazine September 1997

Page 4: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 4/8

were replacing. In 1995, both the North American TDMA andEuropean GSM systems replaced their first-general ion coders.The North American code-division multiple access (CDMA)system held a contest in 199.5 to replace it s first-generationcoder as well. The new generation of coders all handle back-ground noise conditions much better , reflecting the progressof the past six to eight years. Having said that quality is very

multidimensional, the author is still confronted with how toconvcy the relativc quality of different speech codcrs. For thispurpose we return to single encoding quality of clean inputspeech, since this is the test in which peak quality is measuredand its results can be most readily understood. Figure 1 showsthe relative quality of a variety of speec h coders. It is notbased on a real test; rather, it is based on the results of manytests, which the author has melded together to produce thisfigure. I t is a useful device to indica te the relative quality ofdifferent coders. Of particular note is the fact th,at the ITUhas demanded that the quality of low-bit-rate speech codersmatch that of ITU-T Recommendation G.726 32 kb/s ADPCMfor several operating conditions. This was first a requirementfor G.728, and subsequently for the other low-rate coders.

Within a standards effort, there are two more a1.tributes ofspeech coders that are important to mention - pecification

and schedule. How a speech coder is specified variles depend-ing on the intention of the standards body. At one extremeare t he secure telephony standards. Only the bitstream isactually specified. Implementers may use any coder that iscompatible with the bitstream for either the encoder or thedecoder. In practice, they use a coder that has been matchedto the arithmetic and computational capability of the proces-sor. The difficulty with this type of specification is that there isno ensured level of quality. Implementation shoirtcuts maydiminish the performance of the coder drastically, still allow-ing it to comply with th e specification. It is very difficult toverify compliance for a bitstream specification. Most often itis done via subjective testing, comparing the implementationunder test with a known version of the coder, perhiips includ-ing interworking with the bitstream of the known version.

Such testing is both long and expensive.At t he other extreme is a bit-exact specification in which

not only the algorithm is fully specified, but also the arith-metic precision to be used for every single computarional step.This type of specification fully ensures that coders implement-ed by different vendors will still give a level of performanceequivalent to the coder originally tested during the standards-making effort. Two methods of verification have been usedfor bit-exact specifications. A set of test vectors may be creat-ed for both t he encoder and th e decoder. Each test vectorconsists of both a n input and outpu t sequence. In order topass the test, the part under test should give an output signalthat exactly matches the output signal of the given test vector.This method was used by the I TU (starting with G.721 32 kb/sADPCM) and in European standards. The difficulty with this

method is determining the test vectors for codeIs that aresometimes quite complex. It is impossible to create a set oftest vectors that can verify 100 percent compliance. The sec-ond method is to specify the coder in a bit-exact computerprogram. Most commonly the language used is AmericanNational Standards Institute (ANSI) C. The virtue of thismethod is that the specification can be compiled t o creatc aworking model of the coder. All inputs to an implementationshould give the same outputs as the C simulation model. A setof input test vectors may also be included, but otlher inputscan also be used for the testing. This method of specificationhas been adopted for the most recent ITU speech coders.

The schedule of the standards body is an important deter-mining factor in how the standards work proceeds. Historical-

ly, if the ITU has had more time to develop a recommendation,the speech coding algorithms that resulted contained morenovel features and were more extensively tested. If the schedulewas shorter, the coders were more likely to be derivative fromprevious coders, and the testing was more limited. ITU-T Rec-ommendations G.72132 kb/s ADPCM, G.728 16 kb/s LD-CELP,and G.729 8 kb/s CS-ACELP fall in the category of relatively

long schedules. By contrast, the original RecommendationG.723 for 24 and 40 kb/s ADPCM, the ncw RccommcndationG.723.1, and G.729 Annex A are examples of coders that hadrelatively short schedules and consequently less testing.

G.729

The proccss to creatc G.729 began in July 1990 when a prc-liminary set of performance requirements were drafted. Theprincipal intended application was to create a toll-quality 8

kb/s speech coder for wireless applications. Considerabledebate followed over the complexity and delay objectives forthe coder. There were liaison exchanges with the InternationalConsultative Committee for Radio (CCIR) (now the ITU -Radiocommunications Standardization Sector, ITU-R) TaskGro up 811 about th e intended targets. At the hear t of thedebate was the CCI Rs desire for a coder that simultaneously

offered low delay, low complexity, and high quality whileoperating at 8 kb/s over a noisy radio channel. The G.728 16kb/s speech coder, already in the standardization process, hadvery low delay, but i t$complexity worried the CCIR. (To

them it appeared that t he complexity would make the coderuse too much power and cost too much.) At a joint meeting inNovember 1991 of CCIR Task Group 8/1, International Con-sultative Committee for Telephone and Telegraph (CCITT,now ITU-T) Study Group X V (S G XV), and the SpeechQuality Experts G roup (SQ EG, SG XTT), a set of require-ments and objectives was finally agreed upon.2

The schedule called for submission of candidate coders inSeptember 1992. Two coders were submit ted and their hard-ware implementations tested in 1993, but neither candidatemet all the requirements. At a meeting in March 1994, it was

decided to encourage the creation of an optimized coder com-bining the best aspects from both of the previous candidatesand any additional contributions that would lead t o meetingall requirements. The rest of 1994 was used to create the opti-mized coder. A test in early 1995 showed that the optimizedcoder met all requirements, and a second test in late 1995provided additional characterization information. The resultsof the first test were sufficient that G.729 was put forward fordetermination in February 199.5. (This is sort of a first balloton the merits of a recommendation. It means that a stabledraft of the recommendation will be circulated three monthsin advance of the next study group meeting.) In November1995, more than five years after it was initiated, SG15 put(3.729 forward for decision. (This is a second formal ballot ofthe national members of the 1TU to ratify a recommendation.

The process usually takes about three months before the rati-fication is completed and the new recommendation is placedon the list of current ITU recommendations.)

As ment ioned, G.729 was created primarily for wirelessapplications. The bit rate was fixed at 8 kb/s from the outset,and it was agreed that this would not include channel coding.One of thc critical trade-offs was between delay and complexi-ty Initially there waq much sentiment for a 5 ms frame size,

2 Prior to 1993, CCIT T Study Groups used Roma n num erals. After the re-

organizationof the ITUand the creation of the ITU-T, the use of Roman

numerals was d iscontinued. To be historically correct, both are used in this

article as appropriate.

IEEE Communications Magazine September 1997 43

Page 5: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 5/8

but it was felt this might resul t in too much complexity andcould also jeopardize meeting the performance requirements.Ther e was also sentiment for a 20 ms frame size, but i t wasfelt that this might not sufficiently distinguish the coder fromexisting digital cellular standards. Ultimately, the frame sizewas set at 10 ms. This allowed the optimization group to meetthe quality objectives while still delivering a speech coder with

low enough complexity for the CCIR. The lookahead delay is5 ms. Assuming 10 ms for computation processing, t he totalone-way codec delay is 25 ms. The complexity of some of theinitial implementations of G.729 used about 20 MIPS and 3000words of RAM. This is about a 30-35 percent reduction inMIPS compared with G.728, but about 1000 words more RAM.

Table 2 shows some of the performance requirements andobjectives that were defined in the Terms of Reference f or(3.729. Thc re quircm cnts not mct initially were th ose forframe erasures and noisy input speech. Because the coder wascreated for wireless channels, it needed to exhibit robustnessfor both random bit errors and frame erasures. It was testedfor 0.1 percent random bit error rate an d periormed as well as

32 kb/s G.726 ADPCM. It was also tested with 1, 3, and 5 per-cent random and bursty frame erasures. It degraded more

than half a mean opinion wore point from the clear channel

quality for the 3 percent bursty case. More complete detailson the testing results of G.729 are included elsewhere in thisissue [3]. For much more additional information on the p ro-cess of creating G.729, see [4], also in this issue.

G.723.1

The enti re process for G.723.1 w as carried out in a more

urgent mode t han G.729. The f irst commercial very-low-bit-rate videophones were announced in 1992. Within the ITUthere was considerable pressure to define a common family ofstandards for the entire phone, including the speech coder. InNovember 1992, SG15 authorized a special expert’s group tohold a series of meetings to determine the requirements forsuch standards. Meetings were held in February, April, andJune 1993 for this group. SG15 gave authorization to com-mencc the entire very-low-bit-rate videophone standardizationeffort in September 1993. The basic philosophy adopted forthe speech coder was that it was to be a relatively fast stan-dardization process, and that t he speech coder should beimmediately available upon selection. (Interestingly, inSeptembe r 1993, th e (3.729 effort was not selectcd lor thevideophone coder because it was felt that an “off-the-shelf”

coder could be stand ardi ~edmore quickly. In actuality, the

~

W Table 2. Speech qua lifj performance requirements, objectives, and test resultsfor G.729.

44 IEEE Communications Magazine September 1997

Page 6: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 6/8

two coders were completed simul-taneously by SG15.)

In December 1993 a total of 10coders were submitted as candi-dates for this effort. Since the effortwas supposed to be a fast one, thespeech quality testing program wascreated by the Experts Group with

some guidance from SQ EG mem-bers, but it was not as formal a test-ing program as that of G.729. Theproponents for the coders were tosubmit MS-DOS executable filesfor their encoder and decoder. Allof the processing performed by thehost laboratory was to be done viasoftware simulation. However, thecoders were supposed to be simu-lated using integer processingrather than floating point. ByMarch 1994 when the test was held,the number of candidate speechcoders had dwindled to five. The

test was administered by AT&T,which also served as the host labo-rat or^.^ (Testi ng was uerformed

Table 3. Speech qualityperformanceand objectivesfor G.723.1.

using-only Amerian Eiglish, sinceSQEG was not able to mobilize enough resources to conductan internationally coordinated testing effort, which is usuallyconducted in at least two languages per experiment.) Theresults indicated that two coders appeared to achieve the bestquality. A second selection phase t est was performed inFrench by France Telecom/CNET in July 1994. For this sec-ond test, the candidates supplied a lower-bit-ratle coder aswell. Results from this test were also inconclusive in decidingbetween t he two coders. In O ctober 1994 the pr oponentsagreed to create an optimized dual-rate coder. This coder wastested in late December 1994, and the results were reported

at an Experts Group meeting in January 1995. The first ver-sion of the D raft Rec ommenda tion was available for theFebruary 1995 meeting of SG15. Bit-exact, fixed-point C

source code was made available to IT U members in March1995. The coder was put forward f or decision by SG15 inNovember 1995.

G.723.1 is a good example to illustrate a different set oftrade-offs among bit rate, quality, complexity, and delay. Low-bit-rate videophones only send a few frames per second over atelephone bandwidth modem; consequently, they have rela-tively high delays. The speech coding experts felt that a framesize of 30 ms would be acceptable. Since most of the bits inthe bitstream are to b e used for video, it was incumbent touse as few bits for speech as possible. 8 kb/s was, set as anupper limit for the speech coder, but t he highest rate of anyof the five candidates tested was 6.8 kbis. As in tlhe case ofG.729, lower complexity than that of G.728 was desired andachieved. About 16 MIPS and 2200 words of RAM are cur -rently needed to implement G.723.1.

Due to t he nature of th e standards process for G.723.1,formal Terms of Reference was never agreed upon. However,Table 3 shows the performance requ irements teste d forG.723.1 and the actual results achieved at 6.3 kbis. We see

that performance for two encoding, random frame erasures,

The role of the hos t laboratory is to do all of the processing or orm al

subjective tests. Its role is described i n greater detail in [4].

input levels, noisy speech, and signaling tones were consid-ered. Performance for rando m bit errors and bursty frameerasures was not tested because these are not typical of thecommunications channel provided by a wireline tel ephonebandwidth modem. The coder met all objectives except formusic and office noise backgrounds.

Comparing the overall performance of G.723.1 at 6.3 kb/sand G.729, there is not a great deal of difference; nor is theremuch difference in complexity. The real differences are in bitrate and delay. G.723.1, with its 30 ms frame size and 7.5 mslookahead, has a one-way total codec delay of 67.5 ms com-

pared with 25 ms for G.729. If an additional frame is addedfor transmission delay, this becomes 97.5 ms versus 35 ms.G.723.1 and G.729 share the distinction of being the first

ITU speech coder recommendations to be specified in bit-exact fixed-point C code. In fact, they share some of the samebasic operator functions that describe standard arithmeticmost fixed-point DS P chips can easily implement. This meansof specification greatly reduces th e burden of DSP imple-menters from the standpoint of both writing their software aswell as verifying its correctness.

G.723.1 also has three Annexes that are significant mile-stones for the I TU. Annex A describes a single-user silencecompression scheme that can be used with G.723.1 in order toreduce the bit rate further so that bits can be released to t hedigital channel for other uses, such as video coding or trans-mitting data. This annex describes both a voice activity detec-tor and a comfort noise generator. This is the first ITU speechcoder to include these modules. (Their inclusion brings the bitrates to 6.4 and 5.33 kb/s, respectively.) Annex B describes aninteroperable floating point version of the coder. This code canbe used to implement G.723.1 on a host processor, such as aPC. Intel has recently announced an “Inter net telephone”based on G.723.1 and using other I TU Recommendationswhich were part of the ITU low-bit-rate videophone effort.Finally, Annex C describes a scalable channel coder that canbe used with G.723.1 for transmission over noisy channels,such as wireless. Scalable means that the channel coding ratecan be adjusted to match the channel bandwidth available. Ituses punctured convolutional coding to provide unequal error

IEEE Communications Magazine September 1997 45

Page 7: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 7/8

W Table 4. omparisonof attributes of G.723.1, G.729, and G.729AnnexA.

ADPCM coders based on its origi-nal 32 kb/s ADPCM Recommenda-tion G.721. As such thisrepresented a good trade-off fromthe standpoint of meeting market-place needs while still maintainingits own reputation for well-tested,well-written, verifiable standards.

The G.729 Annex A coder con-tains a number of shortcuts com-pared to the original G.729 coder.These were made to save bothMIPS and RAM. They do cause a

protection- he most protection for the most sensitive bits.This is also the first time the ITU has specified a channelerror protection scheme for a speech coder.

G.72 9 ANNEXA

If G.723.1 was considered to have a fast- track schedule, theschedule for the digital simultaneous voice and data (DSVD)

speech coder was even more hurried. SG14 had been follow-ing the progress of G.723.1 and was considering it for use insimultaneous voice and data modems. However, they reached

the conclusion that the delay of 6.723.1 was too great for gen-eral use, especially when the application would involve bridging.(ITU-T Recommendation V.34 modems typically introduceabout 35 ms delay in transmitting data; this would bring th etotal one-way delay for G.723.1 to over 130 ms.) In November1994 SG14 decided to make a re quest t hat SG15 create aDSVD coder. The liaison statement to SG15 was drafted andsent in January 1995. SG15 agreed to comply because th eeffort could be based o n an existing ITU st andard. SG15finalized Terms of Reference at its February 1995 meeting.No reduction in the performance requirements of G.729 werepermitted. The complexity was limited to 2000 words of RAMand 10 MIPS. In order to reduce complexity, the frame sizewas permitted t o be as large as 20 ms if there was no looka-head. The Terms of Reference also included a very aggressive

schedule: having a Draft Recommendation ready for determi-nation by the November 1995 SG15 meeting, and the Recom-mendation ready for decision at the May 1996 SG15 meeting.

In the schedule, there was time for one selection phase testcarried out in two languages, English and Japane se. NTTserved as host laboratory and one of the testing laboratories.COMSAT Laboratories was the other test laboratory. Ratherthan actual hardware (as in G.729), the coders were submittedas PC executables (as in G.723.1). As in the case of G.723.1,the resul ts showed that two coders were most promising anddelivered approximately equivalent performance. One wasbased on a low-delay version of the same algorithms used inG.723.1. The other was bitstream-compatible with G.729. Atthe SG15 meeting in November 1995, the consensus of th edelegates was that the G.729-based coder should be used forthe DSVD coder. It was subsequently agreed that the G.729

modifications contained in the Draft Recommendation wouldbe submitted for determination as Annex A to G.729. AnnexA of G.729 was put forward for decision in May 1996.

Althoug h the Annex A coder had th e most aggressiveschedule of all - 6 months from finalizing the Terms ofReference to forwarding the Recommendation for final ballot- t did not require many new algorithmic contributions ormuch new testing because it was a derivative of G.729. Thiswas an example of a standards group making a trade-off tofavor shortening the schedule as much as possible withoutcompromising quality. The ITU-T’s predecessor, the CCITT,behaved similarly about a decade earlier in creating the origi-nal Recommendation 6.723, which described 24 and 40 kb/s

small loss in performance. It is measurable, butin general isnot statistically significant. G.729 Annex A is intended forapplications where the additional complexity of G.729 orG.723.1 would cause a significant burden. For example, G.729Annex A will use fewer cycles and less RAM on a DSP thanG.729. For t he simultaneous voice and dat a modems thatSG14 was interested in standardizing, this is a significant con-sideration. (If, in the future, complexity is no longer an issue,G.729 Annex A provides a clear upgrade path to using G.729for both users and network operators.)

There is also a G.729 Annex B. It describes a voice activitydetector and comfort noise generator to be used for silencecompression with ei ther G.729 or G.729 Annex A. Annex Bwas put forward for determination by SG 15 in June 1996 andwas approved in October 1996 by the World Telecommunica-tions Standard Congress.

When it is comple ted, Annex C of 6.729 will describe aninteroperable floating point version of both G.729 and G.729Annex A. This could permit implementation of a lower-delayInternet telephone than that based on G.723.1. Annexes D andE will deal with extension of the 6 .72 9 algorithm to upperand lower bit rates. This work is still in its preliminary steps.

COMPARING 6.729, G.723.1, AND G.729 ANNEXA

Table 4 is a comparison of the three coders on the basis of

the attributes discussed earlier. Since quality is an attributewith so many dimensions, we discuss it briefly here. The cleaninput and clear channel quality of the three coders are verysimilar for a single encoding. This judgment is based on theirrelative scores compared to 32 kbls G.726. No marked differ-ences have been observed for tandems, noisy backgrounds,input level, or speaker dependency. However, only G.729 hasbeen extensively tested by the ITU-T. It is the feeling of manyspeech experts that the performance of G.729 is marginallybetter in several of these cases, particularly noisy backgroundsand tandems. Generally, though, the applications engineer hasthe luxury of choosing between these coders based on bit rate,complexity, and delay.

G.723.1 has several advantages with respect to bit rate: ithas the lowest bit rate (6.4 kbls), and it has a lower-bit-rateoption (5.33 kbls). G.729 and G.729 Annex A have the advan-

tage of lower delay compared to G.723.1. Since the inherentdelay of G.723.1 is relatively large, any significant amount ofadditional delay will begin to adversely affect conversationaldynamics. In terms of complexity, G.729 Annex A has theadvantage, requiring only 10 MIPS and 2000 words of RAM.Finally, in terms of availability, 6.723.1 has a small advantagein tha t its floating point version is already a st andard. Itssilence compression scheme was completed and available first,and it has a channel error protection scheme.

In the marketplace, G.723.1 is used in t he Inte l Internetphone. G.729 was modified for digital cellular telephony byNokia and became the basis for IS-641, the enhanced full-rateTDMA digital cellular to be used in North America and else-

46 IEEE Communications Magazine * September 1997

Page 8: Three New Speech Coders

8/3/2019 Three New Speech Coders

http://slidepdf.com/reader/full/three-new-speech-coders 8/8

where. G.729 and G.729 Annex A are being implemented bymodem manufacturers for use in simultaneous voice and da tamodems. Just how widely used these coders will become willonly be known in the future.

lMPLlCATlONS FOR FUTURESTANDARDS

Besides finishing Annex C of G.729, the ITU has two newspeech coding questions currently in process. The experience

it gained with these three coders has had an impact on thesefuture stand ards. The new questions under study ar e for atoll-quality 4 kbis telephone bandwidth speech coder and amultiple-bit-rate 7 kH z bandwidth speech coder. The very firstimpact is that both of these coders will be specified in bit-exact fixed-point C code. Everyone agrees that thi!? s the mostpreferable way to specify speech coders currently.

The Terms of Reference for the 4 kb/s speech coder arevery similar to those of G.729. The requirements involvematching G.729 performan ce for background noise. Therequirement for music was relaxed from “no annoying arti-facts” to “should sound like music.” The Terms of Referencefor the 7 kH z speech coder were discussed for several SG15meetings. Ultimately, they ended in a compromise: there willbe up t o thre e bit rates. At 32 kbis the performance of th ecoder should meet or exceed that of G.722 operating at 64

kb/s. At 24 kb/s their performance should match that of G.722at 56 kb/s. At 16 kb/s the requ irement is to match tha t ofG.722 at 48 kbls with a band-limited 5 kH z signal. A high-quality, low-bit-rate 7 kHz speech coder would be best fea-tured in hands-free operation. This is because a listener needsto h ear a wideband signal with both ears in order to fullyappreciate the increased bandwidth. A telephone hmdset heldto on e ear does not convey th e richness of the extra band-

width. In this context, hands-free operation means a speaker-phone. One of the problems with speakerphones is that theyhave a lower input signal-to-noise ratio for speech. Thismeans that background noise performance is very importantfor these coders. Work on both these questions is in progresswithin the ITU-T.

REFERENCES

[ I ] W. B. Kleijn and K. K. Paliwa l, Eds., Speech Coding an d Synthesis, Elsevi-er Science, 1995.

[2 ] N. 0. Johannesson, “The ETSl Com put ation Mo del: A Tool for Transmis-sion Planning of Telephone Networks,” / E€€ Commun. Mag., Jan. 1997,

131 M. E. Perkins et al., ”Characterizing the Subjective Performance of th eITU-T 8 kb it d Speech Coding Al go rith m (ITU-T Rec. G.729),“ this issue.

[4 ] G. Schroeder and M. H. Sheri f, ”The Road to G.729: The ITU 8 kb/sSpeech Coding Algorithm with Wireline Quality,” this issue.

pp. 70-79.

BIOGRAPHY

RICHARDV. Cox [F l ([email protected]) received the B.S. degree fro m Rut-

gers University, and the M.A. and Ph.D. f rom Princeton University, all in elec-tr ical engineering. He joined Bel l Laboratories in 197 9 and has worke d invarious aspects of speech and audio coding, speech privacy, d igi tal signalprocessing, combined speech and channel noisy channels, and re al-time sig-nal processing implementations. He has been active in the creation of speechcoding standards for digi tal cel lular telephony and the to l l network. He was

one of th e creators of ITU-T Recommendation G.728 and the e ditor for ITV-TRec. G.723.1. In 1987 he was appointed supervisor of the Digital PrinciplesResearch Group, and in 1992 he was appointed head of the Speech CodingResearch Dep artmen t. I n AT&T Labs Research he is currently division managero f the Speech Processing Software and T echnology Research Depa rtmen twi th responsibility for speech and audio coding, text-to-speech synthesis, andhuma n hearing research. He is active in the IEEE Signal Processing Society. Heis a past chairman of t he Speech Technical Committee, served on the A dComand Board of Gov ernors fo r six years, and as treasurer/vice preside nt finance.He is currently vice president for publications.

IEEE Comm unications Magazine Septemb er 1997 47


Recommended