Speech Recognition using Digital Signal Processing Recognition...International Journal of...

International Journal of Electronics, Communication & Soft Computing Science and EngineeringISSN: 2277-9477, Volume 2, Issue 6

31

Speech Recognition using Digital Signal Processing

Mr. Maruti Saundade Mr.Pandurang Kurle

Abstract: - Speech recognition methods can be divided intotext-independent and text dependent methods. In a textindependent system, speaker models capturecharacteristics of somebody's speech, which show upirrespective of what one is saying. In a text-dependentsystem, on the other hand, the recognition of the speaker'sidentity is based on his or her speaking one or morespecific phrases, like passwords, card numbers, PIN codes,etc. This paper is based on text independent speakerrecognition system and makes use of Mel frequencycepstrum coefficients to process the input signal andvector quantization approach to identify the speaker. Theabove task is implemented using MATLAB. Digital SignalProcessing (DSP) is one of the most commonly usedhardware platform that provides good developmentflexibility and requires relatively short applicationdevelopment cycle. DSP techniques have been at the heartof progress in Speech Processing during the last25years.Simultaneously speech processing has been animportant catalyst for the development of DSP theory andpractice. Today DSP methods are used in speech analysis,synthesis, coding, recognition, enhancement as well asvoice modification, speaker recognition, languageidentification. Speech recognition is generallycomputationally-intensive task and includes many ofdigital signal processing algorithms.

I. INTRODUCTION

The objective of human speech is not merely to transferwords from one person to another, but rather tocommunicate, understanding a thought, concept or idea.The final product is not the words or phrases that arespoken and heard, but rather the information conveyedby them. In computer speech recognition, a personspeaks into a microphone or telephone and thecomputer listens. Speech processing is the study ofspeech signals and the processing methods of thesesignals. The signals are usually processed in a digitalrepresentation. So speech processing can be regarded asa special case of digital signals processing applied tospeech signals. Automatic Speech Recognitiontechnology has advanced rapidly in the past decades.Speech recognition is a vast topic of interest and islooked upon as a complex problem. In a practical sense,speech recognition solves problems, improvesproductivity, and changes the way we run our lives.Reliable speech recognition is a hard problem, requiringa combination of many techniques; however modernmethods have been able to achieve an impressivedegree of accuracy [1]. Real-time digital signal

processing made considerable advances after theintroduction of specialized DSP processors.

II. LITERATURE SURVEYEvery speech recognition application is designed toaccomplish a specific task. Examples include: torecognize the digits zero through nine and the words“yes” and “no” over the telephone, to enable bedriddenpatients to control the positioning of their beds, or toimplement a VAT (voice-activated typewriter). Once atask is defined, a speech recognizer is chosen ordesigned for the task.Recognizers fall into one of several categoriesdepending upon whether the system must be “trained”for each individual speaker, whether it requires wordsto be spoken in isolation or can deal with continuousspeech, whether its vocabulary contains a small or alarge number of words, and whether or not it operateswith input received by telephone. Speaker dependentsystems are able to effectively recognize speech onlyfor speakers who have been previously enrolled on thesystem. The aim of speaker independent systems is toremove this restraint and recognize the speech of anytalker without prior enrolment. When a speechrecognition systems requires words to be spokenindividually, in isolation from other words, it is said tobe an isolated word system and recognizes only discretewords and only when they are separated from theirneighbours by distinct interword pauses. Continuousspeech recognizers, on the other hand, allow a morefluent form of talking. Large-vocabulary recognizersare defined to be those that have more than onethousand words in their vocabularies; the others areconsidered small-vocabulary systems. Finally,recognizers designed to perform with lower bandwidthwaveforms as restricted by the telephone network aredifferentiated from those that require a broaderbandwidth input.[4]Digital signal processors are specialtypes of processors that are different from the generalones. Some of the DSP features are high speed DSPcomputations, specialized instruction set, highperformance repetitive numeric calculations, fast andefficient memory accesses, special mechanism for realtime I/O, low power consumption, low cost incomparison with GPP. The important DSPcharacteristics are data path and internal architecture,specialized instruction set, external memoryarchitecture, special addressing modes, specializedexecution control, specialized peripherals for


32

DSP.[6]At the beginning of each implementationprocess is an important decision: the choice ofappropriate hardware platform on which a system ofdigital signal processing is operated. It is necessary tounderstand the hardware aspects in order to implementeffective optimized algorithms. The above hardwareaspects imply several criteria for choosing theappropriate platform: It is preferable to choose a signalprocessor than a processor for general use. It may notbe decisive a processor frequency, but itseffectivenes.DSP tasks require repetitive numericcalculations, alternation to numeric, high memorybandwidth sharing, real time processing. Processorsmust perform these tasks efficiently while minimizingcost, power consumption, memory use, developmenttime. To properly select a suitable architecture for DSPand speech recognition systems, it is necessary toexamine well the available supply and to becomefamiliar with the hardware capabilities of the“candidates”. In the decision it is necessary to take intoaccount some basic features, in which processors fromdifferent manufacturers differ. Most DSPs use fixed-point arithmetic, because in real world signalprocessing the additional range provided by floatingpoint is not needed, and there is a large speed benefitand cost benefit due to reduced hardware complexity.Floating point DSPs may be invaluable in applicationswhere a dynamic range is required. To implementspeech recognition different algorithms like Linearpredictive coding, Advantages of MFCC (MelFrequency Cepstrum coefficient) methods are it iscapable of capturing the phonetically importantcharacteristic of speech, band limiting can easily beemployed to make it suitable for telephone application.

III. FUNCTIONAL DESCRIPTION

Principles of Speaker RecognitionSpeaker recognition can be classified into Identificationand verification. Speaker identification is the process ofdetermining which registered speaker provides a givenutterance. Speaker verification, on the other hand, is theprocess of accepting or rejecting the identity claim of aspeaker. Figure shows the basic structures of speakeridentification And verification systems. At the highestlevel, all speaker recognition systems contain two mainmodules (refer to feature extraction and featurematching. Feature extraction is the process that extractsa small amount of data from the voice signal that canlater be used to represent each speaker. Featurematching involves the actual procedure to identify theunknown speaker by comparing extracted features fromhis/her voice input with the ones from a set of knownspeakers.

Figure a: (speaker identification/recognition)

b: Speaker verification (Speech verification)

Figure 1.Basic structures of speaker recognitionprocess

All speaker recognition systems have to serve twodistinguishes phases. The first one is referred to theenrollment sessions or training phase while the secondone is referred to as the operation sessions or testingphase. In the training phase, each registered speaker hasto provide samples of their speech so that the systemcan build or train a reference model for that speaker. Incase of speaker verification systems, in addition, aspeaker-specific threshold is also computed from thetraining samples. During the testing phase (Figure 1),the input speech is matched with stored reference modeland recognition decision is made.

InputSpeech

FeatureExtraction

Referencemodel(Speech)

m

Similarity

MaximumSelection

Identification result

Similarity


m

InputSpeech

FeatureExtraction

Similarity Decision

Verificationresult(Accept/reject)


m

Threshold

SpeakerID

I


33

SPEECH FEATURE EXTRACTION:

The purpose of this module is to convert the speechwaveform to some type of parametric representation (ata considerably lower information rate) for furtheranalysis and processing. This is often referred as thesignal processing front end. The speech signal is aslowly timed varying signal (it is called quasi-stationary). An example of speech signal is shown inFigure 2. When examined over a sufficiently shortperiod of time (between 5 and 100 ms), itscharacteristics are fairly stationary. However, over longperiods of time (on the order of 1/5 seconds or more)the signal characteristic change to reflect the differentspeech sounds being spoken. Therefore, short-timespectral analysis is the most common way tocharacterize the speech signal.

Figure 2. An example of speech signal

Figure 3: Speech signal in time domain

MEL-FREQUENCYCEPSTRUM COEFFICIENTS

PROCESSOR:

MFCC's are based on the known variation of the humanear's critical bandwidths with frequency, filters spacedlinearly at low frequencies and logarithmically at highfrequencies have been used to capture the phoneticallyimportant characteristics of speech. This is expressed inthe Mel-frequency scale, which is a linear frequencyspacing below 1000 Hz and a logarithmic spacingabove 1000 Hz.A block diagram of the structure of an MFCC processoris given in Figure 3. The speech input is typicallyrecorded at a sampling rate above 10000 Hz. Thissampling frequency was chosen to minimize the effectsof aliasing in the analog-to digital conversion. Thesesampled signals can capture all frequencies up to 5 kHz,which cover most energy of sounds that are generatedby humans. As been discussed previously, the mainpurpose of the MFCC processor is to mimic thebehavior of the human ears. In addition, rather than thespeech waveforms themselves, MFFC's are shown to beless susceptible to mentioned variations.

Continuous

Speech

Frame spectrum

Mel Mel

Cepstrum spectrum

Figure 3. Block diagram of the MFCCprocessor

FRAME BLOCKING:

In this step the continuous speech signal is blocked intoframes of N samples, with adjacent frames beingseparated by M (M < N) The first frame consists of thefirst N samples. The second frame begins M samplesafter the first frame, and overlaps it by N - M samples.Similarly, the third frame begins 2M samples after thefirst frame (or M samples after the second frame) andoverlaps it by N - 2M samples. This process continuesuntil all the speech is accounted for within one or moreframes. Typical values for N and M are N = 256 (whichis equivalent to ~ 30 m sec windowing and facilitate thefast radix-2 FFT and M= 100WINDOWING:The next step in the processing is to window eachindividual frame so as to minimize the signal

FrameBlocking

Windowing

FFT

Melfrequencywrapping

Cepstrum


34

discontinuities at the beginning and end of each frame.The concept here is to minimize the spectral distortionby using the window to taper the signal to zero at thebeginning and end of each frame. If we define thewindow as w (n),0

Date post:	24-Jan-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Speech Recognition using Digital Signal Processing Recognition...International Journal of...

Documents