4. INTRODUCTION TO DIFFERENT SPEECH CODERS 4.1...

27

4. INTRODUCTION TO DIFFERENT SPEECH CODERS

4.1 INTRODUCTION TO PROPERTIES AND PRODUCTION OF SPEECH

Speech is an excellent way to communicate with other people. The speech is an

acoustic sound wave from the speaker‟s vocal organs to the listener‟s ears. The smallest

posited structural unit of the speech is a phoneme. The phonemes can be divided into two

groups .i.e. voiced and unvoiced phonemes. The voiced phonemes are periodic and

unvoiced phonemes are noisy without any periodicity. The voiced phonemes are periodic

in time domain and harmonic in frequency domain. There are peaks in the voiced

phoneme spectrum and those are vocal tract resonance frequencies and are called

formants. It is desirable to have higher formants included while transmitting the speech

signal.

The speech is produced from a filtering operation formed by the larynx, pharynx,

oral and nasal cavity [24]. The sound channel is a physiological filter, which shapes the

stimulus from the lungs. Different sounds are formed by changing the filtering

characteristics. The characteristic of the filter changes depending on the position of the

tongue and lips. The voiced sound forming starts from the lungs. Midriff muscles press

the lungs and causes over pressure to the trachea. The vocal cords start to vibrate at the

frequency called the fundamental frequency. The frequency of vibration is about 100-110

Hz for males and 200 Hz for females. The main difference between the voiced and the

unvoiced sound is that voiced sounds have greater amplitude than the unvoiced sounds

and they are formed in narrow or closed part of the sound channel. Unvoiced sounds are

similar to the random noise. One way to represent the speech production is to use a

simplified source-filter model of speech as shown in the figure 4.1. This kind of a model

28

is used to produce the synthetic speech. The voiced sounds are produced from the glottis

stimulus and the unvoiced from the noise. Both voiced and unvoiced stimuli are

connected to a binary switch. The switch O/P is connected to a linear filter and this

represents the sound channel [24]. The gain G is needed for balancing the speech signal

energy on every stimulus and filter combination. The voiced speech produced by exciting

the vocal tract is periodic pulse and is called the pitch. The short time spectrum of voiced

speech is characterized by its fine and formant structure. The formant structure is due to

the interaction of the source and the vocal tract. The spectral envelope is characterized by

a setoff peaks or formants. The location of the first three formants usually occurring

below 3 kHz and these locations are important in speech perception and can determine

type of sound was produced.

Pitch period

V Innovations

impulse train U(n) Speech signal

UV S(n)

White noise G LPC filter

Fig: 4.1 Source Filter model

4.2 EVALUATION OF SPEECH CODERS

Speech communication is at present the most dominant and common service in

telecommunication network. The attractions of digitally encoded speech are obvious. As

digitally encoded speech ultimately condenses down the binary sequence. All the

advantages offered by the digital systems are available for exploitation. The digitally

encoded speech signals are easy to regenerate, easy for signaling, are highly flexible and

H(z)

29

secured and can be integrated into the integrated service digital networks [ISDN]. The

digitally encoded speech has many advantages over its analog counterpart, but digital

signal requires extra bandwidth. This disadvantage can be overcome using speech

compression technique [25]. The speech encoding is defined as a digital representation of

the speech sound that provides efficient storage, transmission, recovery and faithful

reconstruction of the speech signal. Speech coding has become an intensive area of

research. All speech coding systems involve lossy compression where reconstructed

speech signal is not exact replica of the original signal and hence causes degradation in

the quality. As the complexity of the algorithm increases the implementation cost

increases hence the designer of communication system must strike a balance between

cost and quality.

Speech coding techniques are evaluated by its transmission rate, implementation

complexity, coding delay, robustness to the channel noise and the implementation cost.

The most important criterion is the quality of the reconstructed signal. The speech quality

can be measured using subjective measure or objective measure [8].

Subjective measurements are obtained from listening tests. Speech quality is the

result of a subjective perception-and-judgment process, during which a listener compares

the perceptual event (speech signal heard) to an internal reference of what is judged to be

good quality. Subjective assessment plays a key role in characterizing the quality of

emerging telecommunications products and services, as it attempts to quantify the end

user's experience with the system under test. Typical subjective measures are often

quoted are

Mean opinion score [MOS],

30

Diagnostic Rhyme test [DRT], and

Paired comparison test [PCT].

The MOS is most widely used and is obtained by averaging test results rated by

group of listeners who were asked to quantify their impression on a five point scale.

Commonly, the mean opinion score (MOS) test is used wherein listeners are asked to rate

the quality of a speech signal on a 5-point scale, with 1 corresponding to unsatisfactory

speech quality and 5 corresponding to excellent speech quality. The average of the

listener scores is termed as the subjective listening MOS, or as suggested by ITU-T

Recommendation, MOS-LQS (listening quality subjective). Formal subjective tests,

however, are expensive and time consuming, thus unsuitable for “on-the-fly"

applications.

The following attributes are

If MOS is 5 then obtained speech quality will be excellent.

If MOS is 4 then obtained speech quality will be good.

If MOS is 3 then obtained speech quality will be fair with noticeable

impairments.

If MOS is 2 then obtained speech quality will be poor with strong impairments.

If MOS is 1 then obtained speech quality will be bad and highly degraded.

In case of DRT experienced listeners are asked to distinguish between pairs of

single syllable words such as meat and beat etc. DRT is a quite widely used method and it

provides lots of valuable diagnostic information how properly the initial consonant is

recognized and it is very useful as a developing tool. However, it does not test any

vowels or prosodic features, so it is not suitable for any kind of overall quality evaluation.

31

Other deficiency is that the test material is quite limited and the test items do not occur

with equal probability, so it does not test all possible confusions between consonants.

Thus, confusions presented as matrices are hard to evaluate [8]. In case of PCT the

listener is asked to choose a coder O/P from a pair of coders.

The objective speech quality measurement replaces the listener panel with a

computational algorithm, thus facilitating automated real-time quality measurement.

Indeed, for the purpose of real-time quality monitoring and control on a network-wide

scale, objective speech quality measurement is the only viable option. Objective

measurement methods aim to deliver quality estimates that are highly correlated with

those obtained from subjective listening experiments. Objective quality measurement can

be classified as either signal based or parameter based. The widely used objective

measures include mean squared based errors and the most popular is the signal to noise

ratio (SNR).

12

0

12

0

( )

10logˆ[ ( ) ( )]

M

n

M

n

S n

SNR

S n S n

4.1

Where ( )S n is the original speech data and ˆ( )S n is the coded speech data and M is

number of samples. The SNR is the measure of accuracy of the reconstructed speech

signal. The segmental SNR (SEGSNR) is defined as the dB average of the short time

SNRs.

4.3 SPEECH CODING METHODOLOGY

Speech coders represent analog signal by a sequence of binary digits. The simplest

codec consists of a sampler and a quantizer, where in each sample is represented by a

32

digital code. Coding algorithms seek to minimize the bit rate in the digital representation

of a signal without an objectionable loss of signal quality in the process. High quality is

attained at low bit rates by exploiting signal redundancy as well as the knowledge that

certain types of coding distortion are imperceptible because they are masked by the

signal. Speech coding schemes can be broadly classified into three main sections as

follows,

Wave form coder

Hybrid coder

Vocoder coder

4.3.1 Waveform Coders

The waveform coders are low complex codec. These are signal independent and

work well with speech signals and non-speech signals. Waveform coders are

characterized by their attempt to present the general shape of the signal waveform. These

coders can work well on any input waveform bounded by certain limits in amplitude and

bandwidth. These coders produce high quality speech at rates above 16 kbits/s. When the

data rate is lowered below this level the reconstructed speech quality that can be obtained

degrades rapidly. The different types of waveform coders are

4.3.1.1 Pulse code modulation [PCM]: This merely involves sampling and quantization

of the input speech signal. Narrow-band speech is typically band-limited to 4 kHz and

sampled at 8 kHz. If linear quantization is used then to give good quality speech around

twelve bits per sample are needed, giving a bit rate of 96 Kbits/s. This bit rate can be

reduced by using non-uniform quantization of the samples. In speech coding an

approximation to a logarithmic quantizer is often used. Such quantizers give a signal to

33

noise ratio which is almost constant over a wide range of input levels, and at a rate of

eight bits/sample (or 64 Kbits/s) give a reconstructed signal which is almost

indistinguishable from the original.

4.3.1.2 Adaptive pulse code modulation [APCM]: A scheme for digital encoding of

audio in which successive values represent differences in the sampled wave instead of

absolute values.

4.3.1.3 Delta modulation [DM]: This is the simplest form of DPCM. In this codec the

difference between successive samples are encoded into n-bit data streams and its latest

approximation is done using only one bit of quantization. Since there is one bit

quantization the difference are being coded into two levels only. The quantizer in DM is

realized with a comparator with two bits 1 and 0. The demodulator is a simple integrator.

The two sources of noise in delta modulation are "slope overload" and this noise occurs

when steps are too small to track the original waveform, and "granularity” which occurs

when steps are too large to track the original waveform.

4.3.1.4 Adaptive pulse code modulation [ADPCM]: This codec quantizes the difference

between the speech signal and a prediction that has been made of the speech signal. If the

prediction is accurate then the difference between the real and predicted speech samples

will have a lower variance than the real speech samples. The obtained difference signal

will be accurately quantized by a fewer number of bits than would be needed to quantize

the original speech samples. At the decoder the quantized difference signal is added to

the predicted signal to give the reconstructed speech signal. The performance of the

codec is aided by using adaptive prediction and quantization, so that the predictor and

34

difference quantizer adapt to the changing characteristics of the speech being coded. This

codec is known as G721 and gives very good quality speech at 32 Kbits/sec.

4.3.1.5 Differential Pulse code modulation [DPCM]: When a signal is sampled at the

Nyquist rate the obtained samples are correlated samples. These correlated samples have

maximum redundant information. The DPCM is specifically designed to take the

advantage of the sample to sample redundancies in typical speech waveforms. DPCM is

based on predicting the next sample based on the previous decoded samples. Good

prediction results in a reduction in the dynamic range needed to code the prediction

residual and hence a reduction in the bit rate.

4.3.2 Vocoder

Vocoders are also known as voice coders. These devices take natural speech as

their input and use the same speech to generate various types of acoustic parameters

which usually take up less transmission bandwidth than that of original speech. These

parameters are then transmitted to a re-synthesis device that re-generates the speech.

Vocoders are speech specific in their principles as no attempt is made to preserve the

original speech waveform. This consists of an analyzer and synthesizer. The analyzer

used at the transmitter is used to extract a few set of parameters from the speech signal to

be transmitted. At the receiver the speech is synthesized using the above parameters. The

speech signal produced will be often crude with low toll quality. The different types of

Vocoders are

LPC

Homomorphic MBE

Channel formant phase sinusoidal

35

RELP

4.3.3 Hybrid Coders

Hybrid coders attempt to fill the gap between waveform coder and Vocoder.

Waveform coders are capable of providing good quality speech at higher bit rates but the

signal deteriorates as the bit rate reduces. Vocoders on the other hand can provide

intelligible speech at 2.4 kbits/s and below, but cannot provide natural sounding speech at

any bit rate. To overcome the disadvantages of waveform coders and Vocoders the hybrid

coding methods have been developed which incorporate each of the advantages offered

by the above schemes. Hybrid coders are broadly classified into two sub-categories

Frequency domain hybrid coders

Time domain hybrid coders

4.3.3.1 Frequency domain hybrid coders: The basic concept in the frequency domain

coding is to divide the speech spectrum into frequency bands or components using either

a filter bank or a block transfer analysis. After encoding and decoding, these frequency

components are used to regenerate the replica of the input waveform by either filter bank

summation or inverse transform method. A primary assumption in frequency domain

coding is that the signal to be coded is slowly time varying, which can be locally modeled

with the short time spectrum. In frequency domain coding block of speech can be

represented by filter bank or block transformation. The two well known frequency

domain speech coding techniques are

Subband coding technique

Adaptive transform coding technique

4.3.3.1.1 Subband coding technique [SBC]:

36

Subband coding breaks the signal into a number of different frequency bands and

encodes each one of the data independently. Subband is generally viewed as a waveform

coding technique which uses wide band short time analysis and synthesis. After

partitioning the speech spectrum into a number of bands, each band is low pass translated

to zero frequency, sampled at Nyquist rate, quantized, encoded multiplexed and

transmitted. At the receiver the subbands are demultiplexed, decoded and translated back

to their original frequency position. The resulting subband signal is summed up to give an

approximation of the original speech signal. SBC exploits the deficiency of the human

auditory system. Human ears are normally sensitive to a wide range of frequencies, but

when a sufficiently loud signal is present at one frequency, the ear will not hear weaker

signals at nearby frequencies. We say that the louder signal masks the softer ones. The

louder signal is called the masker and the point at which masking occurs is known as the

masking threshold. The basic idea of SBC is to enable a data reduction by discarding

information about frequencies which are masked. The result differs from the original

signal, but if the discarded information is chosen carefully, the difference will not be

noticeable, or more importantly, objectionable. The speech spectrum can be split into

desired number of bands using several techniques like

Integer band sampling

Tree structure quadrature mirror filters

Discrete cosine transform

Parallel filter banks

4.3.3.1.2 Adaptive transform coder [ATC]:

37

This is a complex frequency analysis technique involving the block transformation of

windowed segments of the input speech. This speech is represented by a set of transform

coefficients. These coefficients are inverse transformed at the receiver to produce a

replica of the original speech. Adjacent segments are joined together to form the

synthesized speech. This has better resolution compared to the subband coding technique

but number of bits required to encode the data are more.

The advantage of frequency domain coder is its exploitation of the non-flat

spectral density of the speech signal, which allows unequal quantization to be applied to

the frequency bands.

4.3.3.2 Time domain hybrid coders: Time domain hybrid coders are dominated by

schemes employing linear predictors. The statistical characteristics of speech signals can

be accurately modeled by a source-filter model which assumes that the speech is resultant

of exciting a linear time varying filter with a periodic pulse train for voiced speech or a

random noise source for the unvoiced speech. These can be classified as analysis by

synthesis (AbS) LPC, in which the system parameters are determined by linear prediction

and the excitation sequence is determined by a closed loop or open loop optimization.

The optimization process determines an excitation sequence, which minimizes a measure

of the weighted difference between the input speech and the coded speech. The weighting

or filtering function is chosen such that the coder is optimized for the human ear. The

most commonly used excitation model used for AbS LPC are the multipulse, regular

pulse excitation and vector or code excitation. Since these methods combines the features

of model-based Vocoders, by representing the formant and the pitch structure of speech

and the properties of waveform coders, they are called hybrid. Although other forms of

38

hybrid codecs exist, the most successful and commonly used are time domain Analysis-

by-Synthesis (AbS) codecs. Such coders use the same linear prediction filter model of the

vocal tract as found in LPC Vocoders. However instead of applying a simple two-state,

voiced/unvoiced model to find the necessary input to this filter, the excitation signal is

chosen by attempting to match the reconstructed speech waveform as closely as possible

to the original speech waveform. A general model for AbS codecs is as shown in the

figure 4.2 below. The AbS codecs work by splitting the input speech to be coded into

frames, typically about 20 ms long. For each frame parameters are determined for a

synthesis filter, and then the excitation to this filter is determined. This is done by finding

the excitation signal which when passed into the given synthesis filter minimizes the

error between the input speech and the reconstructed speech. Thus the name Analysis-by-

Synthesis where in the encoder analyses the input speech by synthesizing many different

approximations to it.

Input Speech

ew(n)

U(n)

ˆ( )S n

U(n) ( )S n

Reproduced speech

Encoder and decoder

Fig: 4.2 Analysis by synthesis Codec Structure

Excitation

generation

generation

Synthesis

Filter

Excitation

Generation

Synthesis

Filter

Error

Weighting

Error

minimization

-

39

Finally for each frame the encoder transmits information representing the

synthesis filter parameters and the excitation to the decoder, and at the decoder the given

excitation is passed through the synthesis filter to give the reconstructed speech. The

synthesis filter is usually an all pole, short-term, linear filter of the form

1

( )( )

H zA z

4.2

Where 1

( ) 1p

i

i

i

A z a z

4.3

In the above equation A(z) is the prediction error filter determined by minimizing

the energy of the residual signal produced when the original speech segment is passed

through it. The order „P‟ of the filter is typically around ten. This filter is intended to

model the correlations introduced into the speech by the action of the vocal tract. The

synthesis filter may also include a pitch filter to model the long-term periodicities present

in voiced speech. Alternatively these long-term periodicities may be exploited by

including an adaptive codebook in the excitation generator so that the excitation signal

U(n) includes a component of the form Gu(n-α), where α is the estimated pitch period.

Generally MPE and RPE codecs will work without a pitch filter, although their

performance will be improved if one is included. For CELP codecs however a pitch filter

is extremely important, for reasons discussed below. The error weighting block is

used to shape the spectrum of the error signal in order to reduce the subjective loudness

of this error. This is possible because the error signal in frequency regions where the

speech has high energy will be at least partially masked by the speech. The weighting

filter emphasizes the noise in the frequency regions where the speech content is low.

Thus minimizing the weighted error concentrates the energy of the error signal in

40

frequency regions where the speech has high energy. Therefore the error signal will be at

least partially masked by the speech, and so its subjective importance will be reduced.

Such weighting is found to produce a significant improvement in the subjective quality of

the reconstructed speech for AbS codec‟s.

The distinguishing feature of AbS codecs is how the excitation waveform U(n) for

the synthesis filter is chosen. Conceptually every possible waveform is passed through

the filter to see what reconstructed speech signal the excitation would produce. The

excitation signal which gives the minimum weighted error between the original and the

reconstructed speech is then chosen by the encoder and is used to drive the synthesis

filter at the decoder. It is this `closed-loop' determination of the excitation signal which

allows AbS codecs to produce good quality speech at low bit rates. However the

numerical complexity involved in passing every possible excitation signal through the

synthesis filter is huge. Usually some means of reducing this complexity, without

compromising the performance of the codec is needed.

The time domain coders can be classified as

Adaptive predictive coding[APC]

Residual excited linear predictive coding [RELP]

Multipulse linear predictive coding [MP-LPC]

Code excited linear predictive coding [CELP]

Vector sum excited linear predictive coding [VSELP]

4.3.3.2.1 Adaptive predictive coding [APC]

41

This coder employs both short term and long term linear predictors. The resultant

signal after inverse filtering is scalar quantized on a sample by sample basis. The APC

scheme was proposed for 16kb/second or below with the variation on its treatment of the

residual signal.

4.3.3.2.2 Residual excited linear predictive coding [RELP]

This is basically APC which transmits only a portion of low frequency residual

signal. The motivation behind the RELP is that the residual information is assumed to be

concentrated in the low frequency baseband signal and encoding only this segment

reduces the number of bits [25]. In RELP the basic LPC analysis yields the spectral

coefficient which are transmitted as side information and also used to inverse filter the

speech signal to obtain e(n). The baseband signal b(n), Which is a low frequency message

signal is extracted by the low pass filter and is waveform coded. At the receiver RELP

receiver interpolates b(n) back to the original sampling rate Fs and attempt to reconstruct

the original signal. In RELP the low frequency signals are waveform coded and hence we

get a very good quality speech signal but high frequency signals are artificially

reconstructed hence the excitation is very poor. RELP [28] is noisier than the APC but

sounds natural than APC. The main advantage of the RELP is its ability to operate in the

noisy environment but the performance is limited at lower bit rate.

4.3.3.2.3 Multipulse linear predictive coding [MP-LPC]

This is an alternate method to reduce the bit rate for the LPC residual operated

in time domain. In MP-LPC the residual signal is represented by small number of pulses

per frame. In case of MP-LPC [28] a multipulse residual signal is constructed by

choosing the amplitude and the position to minimize the perceptually weighted spectral

42

error. The analysis of each frame of speech is done by considering multipulse residual

from the prior frame and the prior frame continues to excite an LPC synthesizer to yield

the output speech for the current frame. The so-called sky phone service employs a

9.6kb/s MP-LPC with half rate convolutional FEC which was chosen as an international

standard for aeronautical mobile satellite telecommunication. From the subjective test it

was found to be the best to satisfy all the requirements like burst and random error

tolerance and robustness to background noise levels [25]. The disadvantage of MP-LPC

is its relatively high computational load.

4.3.3.2.4 Code excited linear predictive coding [CELP]

CELP is mostly suited for the lower bit rate. Here the linear time varying filter are

used to represent the coarse and the fine spectral information [25]. The CELP algorithm

is based on four main ideas. The speech production is mainly done using source-filter

model through the linear prediction. This codec uses adaptive or fixed codebook. The

input excitation signal to the LP model is given from the codebook. The codebook search

is performed in closed loop in a perceptually weighted domain. The modeling of the

vocal tract is done using the Source-filter model [27] as explained below. One of the

main principles behind CELP is called Analysis-by-Synthesis [AbS], where in at encoder

analysis is performed by perceptually optimizing the decoded (synthesized) signal in a

closed loop. In theory, the best CELP stream would be produced by trying all possible bit

combinations and selecting the one that produces the best-sounding decoded signal. This

is obviously not possible in practice for two reasons. First the required complexity is

beyond any currently available hardware and second is the ``best sounding'' selection

criterion implies a human listener.

43

The working of the CELP is as follows:

1.The original speech signal x(n) is first partitioned into analysis frame of around 20-30

ms. The LPC analysis is performed on the frame of x(n) to get the set of LPC coefficients

which are used in the short term predictor[STP] to model the spectral envelope of the

speech.

2. The STP used is assumed to be memoryless type, hence stores only the present value.

Original speech

Weighted LPC coefficients

Zero excitation - +

Weighted LPC coefficients +

Zero excitation -

Weighted LPC coefficients

- +

Fig: 4.3 CELP Coder block schematic

3. Once the LPC coefficients are found these are further given to the long term predictor

[LTP]. The LTP analysis is performed on sub-multiples of LPC frame of 5-10ms. Both

W(z)

1/A(z)

1/A(z) 1/P(z)

Select ‘D’ and ‘β’ for

minimum error

1/A(z) 1/P(z) Codebook

Select index and

gain for minimum

error

+

+

+ +

44

the analysis methods introduces the delay „D‟ and associated scaling factors „β‟ and „i‟

representing then number of filter taps. The LTP introduces voice periodicity into the

synthesized speech.

4. Once the parameters of the filter are found using the coefficients, then the excitation

signal y(n) is selected from the codebook. From the codebook vector the minimum

squared objective error and the corresponding scaling factor is selected.

The block diagram of the standard CELP algorithm is as shown in the figure 4.3

The overall computation can be broken into three blocks

LPC analysis

LTP analysis

Codebook search

Short term prediction [STP]: The role of the STP is to represent the general spectral

shape of the speech signal. The STP coefficients are calculated on a frame by frame

basis. The important problems that can incur are delay and inaccuracy. The idea behind

the CELP [24] concept is to predict the signal x(n) using the linear combination of the

past samples

1

( )N

i

i

y n a x n i

4.4

Where y(n) is the linear prediction of x(n). The prediction error e(n) is thus given by

1

( )N

i

i

e n x n y n x n a x n i

The goal of the LPC analysis is to find the best prediction coefficients ai which

minimizes the quadratic error function

4.5

45

1 1 1

2 2 2

0 0 0 1

( ( )) ( ( )) ( ( ) ( ))L L L N

i

n n n i

E e n x n y n x n a x n i

4.6

This can be done by making all the derivatives equal to zero

12

0 1

( ( ) ( )) 0L N

i

n ii i

Ex n a x n i

a a

4.7

The coefficients ai for an Nth

order filter can be found by solving the system N*N

linear system Ra = r, where

(0) (1)............. ( 1)

(1) (0)............. ( 2)

........................................

........................................

( 1) ( 2)...... (0)

R R R N

R R R N

R

R N R N R

(1)

(2)

.

.

.

( )

R

R

r

R N

The R(m) is the autocorrelation function of the positive signal x(n) and R(m) is

computed as

1

0

( ) ( ) ( )N

i

R m x i x i m

4.8

The „R‟ can be computed using Toeplitz Hermitian algorithm or the Levinson-Durbin

algorithm. Theoretically this stabilizes the roots of A(z) by making the roots to lie within the unit

circle but practically because of finite precision we first multiply R(0) by a number slightly

above one which is equivalent of adding noise and then we use the auto-correlation function

which acts as a filter in frequency domain reducing the sharp resonances .

46

The forward LPC analysis cannot proceed until the whole of one frame or greater number

of samples are available for computation hence a delay of atleast one frame is introduced. Hence

to overcome this disadvantage we prefer to use backward LPC. But the backward LPC operates

successfully above 10kb/s.

The main drawback of LPC occurs when the transition regions which are believed to be

perceptually more important will fall within the frame. Hence this drawback can be overcome

using one of the popular techniques known as frame interpolation. Here we achieve an improved

spectrum representation by evaluating intermediate sets of parameters between the frames such

that the transition are introduced more smoothly at the frame edges without the need to increase

the coding capacity but with the increase in the delay.

Long Term Prediction [LTP]: The LTP has small number of coefficients compared to the

STP. This is given by the general form of equation

( )

1

( ) 1N

D i

i

P z z

4.9

The LTP used in CELP generates the long term correlation functions which

mainly depend on the pitch excitation. Hence the long term predictor is replaced by a

pitch predictor. There are two types of LTP‟s .They are

Open -loop LTP[OLM]

Closed loop LTP[CLM]

In case of open loop LTP a residual signal is obtained by inverse filtering the

original speech with the LPC coefficients. The delay D and the gain G are found. Usually

the delay D will be much greater than the length of the frame L else the effectiveness of

the LTP is reduced as D would not be able to adapt to the onset of the voiced speech as

quickly [25]. The disadvantage of the OLM is due to the error between the original

speech and the quantized speech. This drawback can be overcome using CLM.

47

In case of closed loop method the main aim is to reduce the error between the

synthesized speech signal and the original speech signal by finding the parameters like

gain G, delay D and scaling factor β. This is done in two steps. First we assume the G is

zero and then find the LTP parameters (D and β) such as to minimize the error and

second the LTP is maintained constant and then find G.

4.3.3.2.5 Vector sum excited linear predictive coding [VSELP]:

In normal CELP coding technique the main disadvantage is due to exhaustively

searching the codebook to get best match to synthesize the speech signal with minimum

error. This drawback can be overcome using VSELP. In case of VSELP the vector

combination of the codebook entries is done to minimize the error of the synthesized

signal. For the majority of the speech coding analysis in VSELP the mean square

approximation is used. In VSELP [25] it is very important to construct the basis vectors

in a perceptual way. In VSELP the LTP is treated as an adaptive codebook for LTP lag

values less than the subframe size and hence only the effect of STP filter is considered.

The total STP excitation is obtained by adding the gain scaled secondary excitation to the

LTP excitation. The main drawback of VSELP is its limited ability to encode non-speech

sounds and its performance reduces in the presence of background noise [29].

4.4 SUMMARY

This chapter deals with the basic properties of the speech signal and briefly

explains the technique to produce the speech. This chapter briefs the technique to

generate the speech along with the evaluation methods like subjective measure or an

objective measure. The evaluation methods are used to measure the quality of the speech.

Here we study different speech compression techniques like the waveform coder, vocoder

48

and hybrid coder. A comparative analysis is done between these coders. This chapter

clearly explains the CELP coder which mainly uses AbS technique to synthesis the

speech.

Date post:	07-Apr-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

4. INTRODUCTION TO DIFFERENT SPEECH CODERS 4.1...

Documents