+ All Categories
Home > Documents > The process of speech production and perception in Human Beings

The process of speech production and perception in Human Beings

Date post: 09-Jan-2016
Category:
Upload: rance
View: 73 times
Download: 1 times
Share this document with a friend
Description:
The process of speech production and perception in Human Beings. Speech Generation. The production process (generation) begins when the talker formulates a message in his mind which he wants to transmit to the listener via speech. In case of machine - PowerPoint PPT Presentation
69
1 The process of speech production and perception in Human Beings
Transcript
Page 1: The process of speech production and perception in Human Beings

1

The process of speech production and perception in Human Beings

Page 2: The process of speech production and perception in Human Beings

2

Speech Generation• The production process (generation) begins

when the talker formulates a message in his mind which he wants to transmit to the listener via speech.

• In case of machine

– First step: message formation in terms of printed text.

– Next step: conversion of the message into a language code.

Page 3: The process of speech production and perception in Human Beings

3

Speech Generation• After the language code is chosen the talker

must execute a series of neuromuscular commands to cause the vocal cord to vibrate such that the proper sequence of speech sounds is created.

• The neuromuscular commands must simultaneously control the movement of lips, jaw, tongue, and velum.

Page 4: The process of speech production and perception in Human Beings

4

Speech Perception

• The speech signal is generated and propagated to the listener, the speech perception (recognition) process begins.

• First the listener processes the acoustic signal along the basilar membrane in the inner ear, which provides running spectral analysis of the incoming signal.

Page 5: The process of speech production and perception in Human Beings

5

Sound perception

• The audible frequency range for human is apprx 20Hz to 20KHz

• The three distinct parts of the ear are outer ear, middle ear and inner ear

• Outer ear:– Outer ear includes pinna and auditory canal– It helps to direct an incident sound wave into middle

ear– Filters and modifies the captured sound

Page 6: The process of speech production and perception in Human Beings

6

Page 7: The process of speech production and perception in Human Beings

7

Page 8: The process of speech production and perception in Human Beings

8

• Outer ear:– The perceived sound is sensitive to

the pinna’s shape – By changing the pinnas shape the

sound quality alters as well as background noise

– After passing through ear cannal sound wave strikes the eardrum which is part of middle ear

Page 9: The process of speech production and perception in Human Beings

9

• Middle ear– Ear drum:

• This oscillates with the frequency as that of the sound wave

• Movements of this membrane are then transmitted through the system of small bones called as ossicular system

• From ossicular system to cochlea• Cochlea achieves efficient form of

impedance

matching

Page 10: The process of speech production and perception in Human Beings

10

• Inner ear– It consist of two membranes

Reissner’s membrane and basilar membrane

– When vibrations enter cochlea they stimulate 20 000 to 30 000 stiff hairs on the basilar membrane

– These hair in turn vibrate and generate electrical signal that travel to the brain and become sound

Page 11: The process of speech production and perception in Human Beings

11

Pinna Auditory cannal

ossicular system

cochlea

Basilar membrane

Tympanic membrane

Page 12: The process of speech production and perception in Human Beings

12

Speech Perception

• A neural transduction process converts the spectral signal into activity signals on the auditory nerve.

• The neural activity is converted into a language code in the brain.

• Finally the message comprehension (understanding of meaning) is achieved.

Page 13: The process of speech production and perception in Human Beings

13

Any coding technique could be used to transmit the acoustic waveform from the talker to the listener.

Page 14: The process of speech production and perception in Human Beings

14

The Speech-Production Process

Vocal tract

Opening of vocal cords

Lips (end of the vocal cords)

(Longitudinal cross-section)

Page 15: The process of speech production and perception in Human Beings

15

The Speech-Production Process• Vocal tract consist of

– Pharynx: the connection from the esophagus to the mouth

– Mouth or oral cavity• The total length is about 17 cm• The cross-sectional area determined by the

positions of the tongue, lips, jaw, and velum varies from zero (complete closure) to about 20 cm2

• The nasal tract begins at the velum and ends at the nostrils.

Page 16: The process of speech production and perception in Human Beings

16

The Speech-Production Process

• Velum: a trap-door like mechanism at the back of the mouth cavity.

• When the velum is lowered, the nasal tract is acoustically coupled to the vocal tract to produce nasal sounds of speech.

Page 17: The process of speech production and perception in Human Beings

17

The Speech-Production Process

Page 18: The process of speech production and perception in Human Beings

18

The Speech-Production Process

Page 19: The process of speech production and perception in Human Beings

19

The Speech-Production Process

• The lungs and the associated muscles excites the vocal mechanism.

• The muscle force pushes air out of the lungs and through the bronchi and trachea.

• When the vocal cord is tensed, the air flow causes them to vibrate, produces so called voice-speech sounds.

• When the vocal cord is relaxed a sound is produced.

• Speech is produced as a sequence of sounds.

Page 20: The process of speech production and perception in Human Beings

20

Representing Speech In The Time and Frequency Domains

The speech signal is slowly time varying signal

Page 21: The process of speech production and perception in Human Beings

21

Representing Speech In The Time and Frequency Domains

Depending upon the state of the vocal cords the events in speech are classified– Silence (S): where no speech is produced– Unvoiced (U): in which the vocal cords are not

vibrating, so the resulting speech waveform is aperiodic or random in nature

– Voiced (V):in which the vocal cords are tensed and therefore vibrate periodically when the air flows from the lungs, so the resulting speech waveform is quasi periodic (the pulses are not strictly periodic, but vary slightly from cycle to cycle).

Page 22: The process of speech production and perception in Human Beings

22

Representing Speech In The Time and Frequency Domains

• Spectral representation: An alternative way of characterizing the speech signal and representing the information associated with the sounds.

• Sound Spectrogram (most popular representation): a three dimensional representation of speech intensity, in different frequency bands, over time is portrayed.

Page 23: The process of speech production and perception in Human Beings

25

Speech Sounds and Features

Fronti(IY)I(IH)e(EH)æ(AE)

Page 24: The process of speech production and perception in Human Beings

26

Phonetics

Phonetics (from the Greek word φωνή, phone = sound/voice) is the study of sounds (voice). It is concerned with the actual properties of speech sounds (phones) as well as those of non-speech sounds, and their production, audition and perception

Page 25: The process of speech production and perception in Human Beings

27

Phonetics has three main branches:• articulatory phonetics, concerned with the

positions and movements of the lips, tongue, vocal tract and folds and other speech organs in producing speech

• acoustic phonetics, concerned with the properties of the sound waves and how they are received by the inner ear

• auditory phonetics, concerned with speech perception, principally how the brain forms perceptual representations of the input it receives.

Page 26: The process of speech production and perception in Human Beings

28

• Phoneme: (linguistics) one of a small set of speech sounds that are distinguished by the speakers of a particular language

• Syllable : A unit of spoken language larger than a phoneme

Page 27: The process of speech production and perception in Human Beings

29

Speech Sounds and FeaturesThe vowels:

• The vowel sounds are interesting class of sounds in English.

• The practical speech-recognition systems rely heavily on vowel recognition to achieve high performance.

• If we omit the vowel letters in the sentence then resulting text is easy to decode.

• But if we omit consonant letters the resulting text is not decodable.

Page 28: The process of speech production and perception in Human Beings

30

Speech Sounds and Features

They noted significant improvements in the company’s image, supervision, their working conditions, benefits and opportunities for growth

Page 29: The process of speech production and perception in Human Beings

31

Speech Sounds and Features

• In speaking, vowels are produced by exciting an essentially fixed vocal tract shape with quasi-periodic pulses of air caused by the vibration of the vocal cords.

• The way in which cross sectional area varies along the vocal tract determines the resonance frequencies of the tract and thereby the sound is produced.

Page 30: The process of speech production and perception in Human Beings

32

Speech Sounds and Features

• The vowel sound produced is determined by the position of the tongue.

• The positions of the jaw, lips, and to a small extent, the velum, also influence the result of the sound.

Page 31: The process of speech production and perception in Human Beings

33

Speech Sounds and Features

• The vowels are long in duration compare to the consonant sound.

• They are spectrally well defined.

• Easily and reliably recognized, both by machine and by humans.

Page 32: The process of speech production and perception in Human Beings

34

Speech Sounds and Features

A convenient and simplified way of classifying vowelarticulatory configuration is in terms of tongue hump position (front, mid, back) and tongue hump height (high, mid, low)

Page 33: The process of speech production and perception in Human Beings

35

Speech Sounds and Features

The concept of a “typical” vowel sound is unreasonable in light of the variability of vowel pronunciation among men, women and children with different regional accents and other variable characteristics.

Page 34: The process of speech production and perception in Human Beings

40

Diphthong

• In phonetics, a diphthong (Greek δίφθογγος, "diphthongos", literally "with two sounds") is a vowel combination involving a quick but smooth movement from one vowel to another,

• Often interpreted by listeners as a single vowel sound or phoneme.

• While "pure" vowels, or monophthongs, are said to have one target tongue position, diphthongs have two target tongue positions.

Page 35: The process of speech production and perception in Human Beings

41

• Pure vowels are represented in the International Phonetic Alphabet by one symbol: English "sum" as [səm], for example.

• Diphthongs are represented by two symbols, for example English "same" as [seɪm], where the two vowel symbols are intended to represent approximately the beginning and ending tongue positions.

Page 36: The process of speech production and perception in Human Beings

42

Diphthongs in the British English:• [əʊ] as in hope • [aʊ] as in house • [aɪ] as in kite • [eɪ] as in same • [juː] as in few • [ɔɪ] as in join • [ɪə] as in fear • [ɛə] as in hair • [ʊə] as in poor

Page 37: The process of speech production and perception in Human Beings

44

Speech Sounds and Features

Semivowels: (A vowel-like sound that serves as a consonant)

• The group of sounds consisting of /w/, /l/, /r/, and /y/ is quite difficult to characterize. These sounds are called semivowels because of their vowel-like nature.

• Similar nature to the vowels and diphthongs

• Ex. In ‘how’ w sounds near to ‘oo’ in boot

Page 38: The process of speech production and perception in Human Beings

45

The English word "well" would sound the same if it were spelled "uell" or "ooell". Things must be considered in the opposite manner: the fact that spellings "uell" or "ooell" might be also pronounced as /w/ doesn’t mean that /w/ is a vowel as in boot. The graphical form has nothing to do with phonetics, so let it aside and you’ll find that /w/ is a consonant, and /u(:)/ a vowel

Page 39: The process of speech production and perception in Human Beings

46

Speech Sounds and Features

Nasal Consonants

• The nasal consonants /m/, /n/, /η/ are produced with glottal excitation and the vocal tract totally constricted at some point along the oral passageway

• The velum is lowered so that air flows through the nasal tract, with noise being radiated at the nostrils.

Page 40: The process of speech production and perception in Human Beings

47

• The oral cavity, although constricted toward front, is still acoustically coupled to the pharynx.

• The mouth serves as a resonant cavity that traps acoustic energy at certain natural frequencies.

Page 41: The process of speech production and perception in Human Beings

48

Voiced consonant • A voiced consonant is a sound made as the

vocal cords vibrate, as opposed to a voiceless consonant, where the vocal cords are relaxed. See phonation for a continuum of degrees of tension in the vocal cords.

• Examples of voiced-voiceless pairs of consonants are:

• Voiced Voiceless

• [b] [p]

• [d] [t]

• [g] [k]

Page 42: The process of speech production and perception in Human Beings

49

Voiced consonant

If you place your fingers on your voice box , you can feel a buzz when you pronounce zzzz, but not when you pronounce ssss. That buzz is the vibration of your vocal cords. Except for this, the sounds [s] and [z] are practically identical, with the same use of tongue and lips

Page 43: The process of speech production and perception in Human Beings

50

Unvoiced consonant

• In phonetics, a voiceless consonant is a consonant that doesn't have voicing. That is, it is produced without vibration of the vocal cords.

• Voiceless consonants are usually articulated more strongly than their voiced counterparts, because in voiced consonants, the energy used in pronunciation is split between the laryngeal vibration and the oral articulation.

• Ex. peculiar and particular.

Page 44: The process of speech production and perception in Human Beings

51

Speech Sounds and Features

Voiced Fricatives:• The voice fricatives /v/, /th/, /z/, and /zh/.• The unvoiced fricatives /f/, /ө/, /s/, and /sh/,

respectively• For voice fricatives the vocal cords are vibrating• Since the vocal tract is constricted (narrow) at some

point forward of the glottis, the air flow becomes turbulent in the neighborhood of the constriction

Page 45: The process of speech production and perception in Human Beings

52

Speech Sounds and FeaturesVoiced Stops• The voiced stop consonants /b/, /d/, and /g/,

are transient, noncontinuant sounds produced by building up pressure behind the total constriction somewhere in the oral tract and then suddenly reducing the pressure.

• For /b/ the constriction is at lips; for /d/ it is at the back of the teeth, and for /g/ it is near the velum.

Unvoiced stop• The unvoiced stop consonants /p/, /t/, and /k/.

Page 46: The process of speech production and perception in Human Beings

53

Approaches to Automatic Speech Recognition by Machine

Three approaches– The acoustic-phonetic approach.

– The pattern recognition approach.

– The artificial intelligence apporach

Page 47: The process of speech production and perception in Human Beings

54

Approaches to Automatic Speech Recognition by Machine

• The smallest meaningful unit (linguistically distinct) in speech is called a phoneme, which does not have any meaning alone, but it makes possible to discriminate between different words.

• Speech signals are sequences of linguistically separable units (phonemes).

• Phonemes transitions are mostly relatively smooth.

• A phone signifies the physical sound that is produced when a phoneme is uttered.

• One phoneme can be pronounced in different ways, therefore a phone group containing similar variants of a single phoneme is called an allphone.

Page 48: The process of speech production and perception in Human Beings

55

Approaches to Automatic Speech Recognition by Machine

The acoustic phonetic approach is based on the theory of acoustic phonetics

• First step: segmentation and labeling

• Second step: to determine a valid word

Page 49: The process of speech production and perception in Human Beings

56

Approaches to Automatic Speech Recognition by Machine

The pattern-recognition approach to speech recognition is basically one in which the speech patterns are used directly without explicit feature determination and segmentation

• Step one: training of speech patterns

• Step two: recognition of pattern via pattern comparison

• Speech “knowledge” is brought into the system via the training procedure

Page 50: The process of speech production and perception in Human Beings

57

Approaches to Automatic Speech Recognition by Machine

The pattern recognition is the method of choice for speech recognition for three reasons:

1. Simplicity of use. The method is easy to understand, it is rich in mathematical and communication theory justification for individual procedures used in training and decoding, and it is widely used and understood.

2. Robustness and invariance to different speech vocabularies, users, feature sets, pattern comparison algorithms and decision rules.

3. Proven high performance.

Page 51: The process of speech production and perception in Human Beings

58

Approaches to Automatic Speech Recognition by Machine

The AI approach is a hybrid of the acoustic-phonetic approach and the pattern recognition approach

• The AI approach attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualizing , analyzing and finally making a decision on a measured acoustic features.

Page 52: The process of speech production and perception in Human Beings

59

Acoustic-phonetic Approach to Speech Recognition

Provides spectral representation

Page 53: The process of speech production and perception in Human Beings

60

Acoustic-phonetic Approach to Speech Recognition

1. Speech Analysis System: (common to all approaches to speech recognition) Provides an appropriate representation of the characteristics of the time varying speech signal. Technique used is linear predictive coding (LPC) method.

2. Feature detection stage: Convert the spectral measurements to a set of features that describe the acoustic properties of the different phonetic units.

– Features:

– Nasality: presence or absence of nasal resonance

– Friction: presence or absence of random excitation in the speech

– Formant locations: frequencies of the first three resonances

– Voiced and unvoiced classification: periodic and aperiodic excitation

Page 54: The process of speech production and perception in Human Beings

61

Acoustic-phonetic Approach to Speech Recognition

3. The segmentation and labeling phase: The system tries to find stable regions and then to label the segmented region according to how well the features within that region match those of individual phonetic units.

4. This stage is the heart of the acoustic-phonetics recognizer and is the most difficult one to carry out reliably.

5. So various control strategies are used to limit the range of segmentation points and label possibilities.

6. The final output of the recognizer is the word or word sequence, in some well-defined sense.

Page 55: The process of speech production and perception in Human Beings

62

Speech Sound Classifier

Page 56: The process of speech production and perception in Human Beings

63

Acoustic Phonetic Vowel Classifier

Page 57: The process of speech production and perception in Human Beings

64

Acoustic Phonetic Vowel Classifier

• Three features have been detected over the segment, first formant, F1, second formant, F2, and duration of the segment, D.

• The first test separates vowels with low F1from vowels with high F1.

• Each of these subsets can be split further on the basis of F2 measurement with high F2 and low F2.

• The third test is based on segment duration, which separates tense vowels (large value of D) from lax vowels (small values of D).

• Finally, a finer test on formant values separates the remaining unresolved vowels, resolving the vowels into flat vowels and plain vowels

Page 58: The process of speech production and perception in Human Beings

65

Problems In Acoustic Phonetic Approach

• The method requires extensive knowledge of the acoustic properties of phonetic units.

• For most systems the choice of features is based on intuition and is not optimal in a well defined and meaningful sense.

• The design of sound classifiers is also not optimal.

• No well-defined, automatic procedure exists for tuning the method on real, labeled speech.

Page 59: The process of speech production and perception in Human Beings

66

Statistical Pattern-Recognition Approach to Speech Recognition

Page 60: The process of speech production and perception in Human Beings

67

Statistical Pattern-Recognition Approach to Speech Recognition

• Feature measurement: in which sequence of measurement is made on the input signal to define “test pattern”. The feature measurements are usually the o/p of spectral analysis technique, such as filter bank analyzer, a LPC, or a DFT analysis.

• Pattern training: creates a reference pattern (for different sound class) called as Template.

• Pattern classification: unknown test pattern is compared with each (sound) class reference pattern and a measure of distance between the test pattern and each reference pattern is computed.

• Decision logic: the reference pattern similarity (distance) scores are used to decide which reference pattern best matches the unknown test pattern.

Page 61: The process of speech production and perception in Human Beings

68

The strengths and weakness of the Pattern-Recognition model

• The performance of the system is sensitive to the amount of training data available for creating sound class reference pattern.( more training, higher performance)

• The reference patterns are sensitive to the speaking environment and transmission characteristics of the medium used to create the speech. (because speech spectral characteristics are affected by transmission and background noise)

• The method is relatively insensitive syntax and semantics.

Page 62: The process of speech production and perception in Human Beings

69

The strengths and weakness of the Pattern-Recognition model

• The system is insensitive to the sound class. So the techniques are applied to wide range of speech sounds (phrases).

• It is easy to incorporate the syntactic and semantic constraints directly into the pattern-recognition structure, thereby improving recognition accuracy and reducing computation.

Page 63: The process of speech production and perception in Human Beings

70

AI Approaches To Speech Recognition

The basic idea of AI is to compile and incorporate the knowledge from variety of knowledge sources to solve the problem.

• Acoustic Knowledge: Knowledge related to sound or sense of hearing

• Lexical Knowledge: Knowledge of the words of the language. (decomposing words into sounds)

• Syntactic Knowledge: Knowledge of syntax

• Semantic Knowledge: Knowledge of the meaning of the language.

• Pragmatic Knowledge: (sense derived from meaning) inference ability necessary in resolving ambiguity of meaning based on ways in which words are generally used.

Page 64: The process of speech production and perception in Human Beings

71

AI Approaches To Speech Recognition

1. Go to the refrigerator and get me a book .syntactically correct but semantically inconsistent.

2. The bears killed the rams.can be interpreted in two pragmatically different

ways (jungle / football)

3. Power plants colorless happily old.Syntactically unacceptable and semantically meaningless

4. Good ideas often run when least expected.Semantically inconsistent. (corrected by replacing run bycome)

Page 65: The process of speech production and perception in Human Beings

72

AI Approaches To Speech Recognition

Several ways to integrate knowledge sources within a speech recognizer.

1 Bottom-Up

2 Top-Down

3 Black Board

Page 66: The process of speech production and perception in Human Beings

73

AI Approaches To Speech Recognition

• “Bottom-Up” Approach: The lowest level processes (feature detection, phonetic decoding) precede higher level processes (lexical coding) in a sequential manner

Page 67: The process of speech production and perception in Human Beings

74

AI Approaches To Speech Recognition

• “Top-Up” Approach: in this the language model generate word hypotheses that are matched against the speech signal, and syntactically and semantically meaningful sentences are built up on the basis of word match scores.

Page 68: The process of speech production and perception in Human Beings

75

AI Approaches To Speech Recognition

Page 69: The process of speech production and perception in Human Beings

76

AI Approaches To Speech Recognition

• “Black Board” Approach: All knowledge sources (KS) are considered independent


Recommended