1
The process of speech production and perception in Human Beings
2
Speech Generation• The production process (generation) begins
when the talker formulates a message in his mind which he wants to transmit to the listener via speech.
• In case of machine
– First step: message formation in terms of printed text.
– Next step: conversion of the message into a language code.
3
Speech Generation• After the language code is chosen the talker
must execute a series of neuromuscular commands to cause the vocal cord to vibrate such that the proper sequence of speech sounds is created.
• The neuromuscular commands must simultaneously control the movement of lips, jaw, tongue, and velum.
4
Speech Perception
• The speech signal is generated and propagated to the listener, the speech perception (recognition) process begins.
• First the listener processes the acoustic signal along the basilar membrane in the inner ear, which provides running spectral analysis of the incoming signal.
5
Sound perception
• The audible frequency range for human is apprx 20Hz to 20KHz
• The three distinct parts of the ear are outer ear, middle ear and inner ear
• Outer ear:– Outer ear includes pinna and auditory canal– It helps to direct an incident sound wave into middle
ear– Filters and modifies the captured sound
6
7
8
• Outer ear:– The perceived sound is sensitive to
the pinna’s shape – By changing the pinnas shape the
sound quality alters as well as background noise
– After passing through ear cannal sound wave strikes the eardrum which is part of middle ear
9
• Middle ear– Ear drum:
• This oscillates with the frequency as that of the sound wave
• Movements of this membrane are then transmitted through the system of small bones called as ossicular system
• From ossicular system to cochlea• Cochlea achieves efficient form of
impedance
matching
10
• Inner ear– It consist of two membranes
Reissner’s membrane and basilar membrane
– When vibrations enter cochlea they stimulate 20 000 to 30 000 stiff hairs on the basilar membrane
– These hair in turn vibrate and generate electrical signal that travel to the brain and become sound
11
Pinna Auditory cannal
ossicular system
cochlea
Basilar membrane
Tympanic membrane
12
Speech Perception
• A neural transduction process converts the spectral signal into activity signals on the auditory nerve.
• The neural activity is converted into a language code in the brain.
• Finally the message comprehension (understanding of meaning) is achieved.
13
Any coding technique could be used to transmit the acoustic waveform from the talker to the listener.
14
The Speech-Production Process
Vocal tract
Opening of vocal cords
Lips (end of the vocal cords)
(Longitudinal cross-section)
15
The Speech-Production Process• Vocal tract consist of
– Pharynx: the connection from the esophagus to the mouth
– Mouth or oral cavity• The total length is about 17 cm• The cross-sectional area determined by the
positions of the tongue, lips, jaw, and velum varies from zero (complete closure) to about 20 cm2
• The nasal tract begins at the velum and ends at the nostrils.
16
The Speech-Production Process
• Velum: a trap-door like mechanism at the back of the mouth cavity.
• When the velum is lowered, the nasal tract is acoustically coupled to the vocal tract to produce nasal sounds of speech.
17
The Speech-Production Process
18
The Speech-Production Process
19
The Speech-Production Process
• The lungs and the associated muscles excites the vocal mechanism.
• The muscle force pushes air out of the lungs and through the bronchi and trachea.
• When the vocal cord is tensed, the air flow causes them to vibrate, produces so called voice-speech sounds.
• When the vocal cord is relaxed a sound is produced.
• Speech is produced as a sequence of sounds.
20
Representing Speech In The Time and Frequency Domains
The speech signal is slowly time varying signal
21
Representing Speech In The Time and Frequency Domains
Depending upon the state of the vocal cords the events in speech are classified– Silence (S): where no speech is produced– Unvoiced (U): in which the vocal cords are not
vibrating, so the resulting speech waveform is aperiodic or random in nature
– Voiced (V):in which the vocal cords are tensed and therefore vibrate periodically when the air flows from the lungs, so the resulting speech waveform is quasi periodic (the pulses are not strictly periodic, but vary slightly from cycle to cycle).
22
Representing Speech In The Time and Frequency Domains
• Spectral representation: An alternative way of characterizing the speech signal and representing the information associated with the sounds.
• Sound Spectrogram (most popular representation): a three dimensional representation of speech intensity, in different frequency bands, over time is portrayed.
25
Speech Sounds and Features
Fronti(IY)I(IH)e(EH)æ(AE)
26
Phonetics
Phonetics (from the Greek word φωνή, phone = sound/voice) is the study of sounds (voice). It is concerned with the actual properties of speech sounds (phones) as well as those of non-speech sounds, and their production, audition and perception
27
Phonetics has three main branches:• articulatory phonetics, concerned with the
positions and movements of the lips, tongue, vocal tract and folds and other speech organs in producing speech
• acoustic phonetics, concerned with the properties of the sound waves and how they are received by the inner ear
• auditory phonetics, concerned with speech perception, principally how the brain forms perceptual representations of the input it receives.
28
• Phoneme: (linguistics) one of a small set of speech sounds that are distinguished by the speakers of a particular language
• Syllable : A unit of spoken language larger than a phoneme
29
Speech Sounds and FeaturesThe vowels:
• The vowel sounds are interesting class of sounds in English.
• The practical speech-recognition systems rely heavily on vowel recognition to achieve high performance.
• If we omit the vowel letters in the sentence then resulting text is easy to decode.
• But if we omit consonant letters the resulting text is not decodable.
30
Speech Sounds and Features
They noted significant improvements in the company’s image, supervision, their working conditions, benefits and opportunities for growth
31
Speech Sounds and Features
• In speaking, vowels are produced by exciting an essentially fixed vocal tract shape with quasi-periodic pulses of air caused by the vibration of the vocal cords.
• The way in which cross sectional area varies along the vocal tract determines the resonance frequencies of the tract and thereby the sound is produced.
32
Speech Sounds and Features
• The vowel sound produced is determined by the position of the tongue.
• The positions of the jaw, lips, and to a small extent, the velum, also influence the result of the sound.
33
Speech Sounds and Features
• The vowels are long in duration compare to the consonant sound.
• They are spectrally well defined.
• Easily and reliably recognized, both by machine and by humans.
34
Speech Sounds and Features
A convenient and simplified way of classifying vowelarticulatory configuration is in terms of tongue hump position (front, mid, back) and tongue hump height (high, mid, low)
35
Speech Sounds and Features
The concept of a “typical” vowel sound is unreasonable in light of the variability of vowel pronunciation among men, women and children with different regional accents and other variable characteristics.
40
Diphthong
• In phonetics, a diphthong (Greek δίφθογγος, "diphthongos", literally "with two sounds") is a vowel combination involving a quick but smooth movement from one vowel to another,
• Often interpreted by listeners as a single vowel sound or phoneme.
• While "pure" vowels, or monophthongs, are said to have one target tongue position, diphthongs have two target tongue positions.
41
• Pure vowels are represented in the International Phonetic Alphabet by one symbol: English "sum" as [səm], for example.
• Diphthongs are represented by two symbols, for example English "same" as [seɪm], where the two vowel symbols are intended to represent approximately the beginning and ending tongue positions.
42
Diphthongs in the British English:• [əʊ] as in hope • [aʊ] as in house • [aɪ] as in kite • [eɪ] as in same • [juː] as in few • [ɔɪ] as in join • [ɪə] as in fear • [ɛə] as in hair • [ʊə] as in poor
44
Speech Sounds and Features
Semivowels: (A vowel-like sound that serves as a consonant)
• The group of sounds consisting of /w/, /l/, /r/, and /y/ is quite difficult to characterize. These sounds are called semivowels because of their vowel-like nature.
• Similar nature to the vowels and diphthongs
• Ex. In ‘how’ w sounds near to ‘oo’ in boot
45
The English word "well" would sound the same if it were spelled "uell" or "ooell". Things must be considered in the opposite manner: the fact that spellings "uell" or "ooell" might be also pronounced as /w/ doesn’t mean that /w/ is a vowel as in boot. The graphical form has nothing to do with phonetics, so let it aside and you’ll find that /w/ is a consonant, and /u(:)/ a vowel
46
Speech Sounds and Features
Nasal Consonants
• The nasal consonants /m/, /n/, /η/ are produced with glottal excitation and the vocal tract totally constricted at some point along the oral passageway
• The velum is lowered so that air flows through the nasal tract, with noise being radiated at the nostrils.
47
• The oral cavity, although constricted toward front, is still acoustically coupled to the pharynx.
• The mouth serves as a resonant cavity that traps acoustic energy at certain natural frequencies.
48
Voiced consonant • A voiced consonant is a sound made as the
vocal cords vibrate, as opposed to a voiceless consonant, where the vocal cords are relaxed. See phonation for a continuum of degrees of tension in the vocal cords.
• Examples of voiced-voiceless pairs of consonants are:
• Voiced Voiceless
• [b] [p]
• [d] [t]
• [g] [k]
49
Voiced consonant
If you place your fingers on your voice box , you can feel a buzz when you pronounce zzzz, but not when you pronounce ssss. That buzz is the vibration of your vocal cords. Except for this, the sounds [s] and [z] are practically identical, with the same use of tongue and lips
50
Unvoiced consonant
• In phonetics, a voiceless consonant is a consonant that doesn't have voicing. That is, it is produced without vibration of the vocal cords.
• Voiceless consonants are usually articulated more strongly than their voiced counterparts, because in voiced consonants, the energy used in pronunciation is split between the laryngeal vibration and the oral articulation.
• Ex. peculiar and particular.
51
Speech Sounds and Features
Voiced Fricatives:• The voice fricatives /v/, /th/, /z/, and /zh/.• The unvoiced fricatives /f/, /ө/, /s/, and /sh/,
respectively• For voice fricatives the vocal cords are vibrating• Since the vocal tract is constricted (narrow) at some
point forward of the glottis, the air flow becomes turbulent in the neighborhood of the constriction
52
Speech Sounds and FeaturesVoiced Stops• The voiced stop consonants /b/, /d/, and /g/,
are transient, noncontinuant sounds produced by building up pressure behind the total constriction somewhere in the oral tract and then suddenly reducing the pressure.
• For /b/ the constriction is at lips; for /d/ it is at the back of the teeth, and for /g/ it is near the velum.
Unvoiced stop• The unvoiced stop consonants /p/, /t/, and /k/.
53
Approaches to Automatic Speech Recognition by Machine
Three approaches– The acoustic-phonetic approach.
– The pattern recognition approach.
– The artificial intelligence apporach
54
Approaches to Automatic Speech Recognition by Machine
• The smallest meaningful unit (linguistically distinct) in speech is called a phoneme, which does not have any meaning alone, but it makes possible to discriminate between different words.
• Speech signals are sequences of linguistically separable units (phonemes).
• Phonemes transitions are mostly relatively smooth.
• A phone signifies the physical sound that is produced when a phoneme is uttered.
• One phoneme can be pronounced in different ways, therefore a phone group containing similar variants of a single phoneme is called an allphone.
55
Approaches to Automatic Speech Recognition by Machine
The acoustic phonetic approach is based on the theory of acoustic phonetics
• First step: segmentation and labeling
• Second step: to determine a valid word
56
Approaches to Automatic Speech Recognition by Machine
The pattern-recognition approach to speech recognition is basically one in which the speech patterns are used directly without explicit feature determination and segmentation
• Step one: training of speech patterns
• Step two: recognition of pattern via pattern comparison
• Speech “knowledge” is brought into the system via the training procedure
57
Approaches to Automatic Speech Recognition by Machine
The pattern recognition is the method of choice for speech recognition for three reasons:
1. Simplicity of use. The method is easy to understand, it is rich in mathematical and communication theory justification for individual procedures used in training and decoding, and it is widely used and understood.
2. Robustness and invariance to different speech vocabularies, users, feature sets, pattern comparison algorithms and decision rules.
3. Proven high performance.
58
Approaches to Automatic Speech Recognition by Machine
The AI approach is a hybrid of the acoustic-phonetic approach and the pattern recognition approach
• The AI approach attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualizing , analyzing and finally making a decision on a measured acoustic features.
59
Acoustic-phonetic Approach to Speech Recognition
Provides spectral representation
60
Acoustic-phonetic Approach to Speech Recognition
1. Speech Analysis System: (common to all approaches to speech recognition) Provides an appropriate representation of the characteristics of the time varying speech signal. Technique used is linear predictive coding (LPC) method.
2. Feature detection stage: Convert the spectral measurements to a set of features that describe the acoustic properties of the different phonetic units.
– Features:
– Nasality: presence or absence of nasal resonance
– Friction: presence or absence of random excitation in the speech
– Formant locations: frequencies of the first three resonances
– Voiced and unvoiced classification: periodic and aperiodic excitation
61
Acoustic-phonetic Approach to Speech Recognition
3. The segmentation and labeling phase: The system tries to find stable regions and then to label the segmented region according to how well the features within that region match those of individual phonetic units.
4. This stage is the heart of the acoustic-phonetics recognizer and is the most difficult one to carry out reliably.
5. So various control strategies are used to limit the range of segmentation points and label possibilities.
6. The final output of the recognizer is the word or word sequence, in some well-defined sense.
62
Speech Sound Classifier
63
Acoustic Phonetic Vowel Classifier
64
Acoustic Phonetic Vowel Classifier
• Three features have been detected over the segment, first formant, F1, second formant, F2, and duration of the segment, D.
• The first test separates vowels with low F1from vowels with high F1.
• Each of these subsets can be split further on the basis of F2 measurement with high F2 and low F2.
• The third test is based on segment duration, which separates tense vowels (large value of D) from lax vowels (small values of D).
• Finally, a finer test on formant values separates the remaining unresolved vowels, resolving the vowels into flat vowels and plain vowels
65
Problems In Acoustic Phonetic Approach
• The method requires extensive knowledge of the acoustic properties of phonetic units.
• For most systems the choice of features is based on intuition and is not optimal in a well defined and meaningful sense.
• The design of sound classifiers is also not optimal.
• No well-defined, automatic procedure exists for tuning the method on real, labeled speech.
66
Statistical Pattern-Recognition Approach to Speech Recognition
67
Statistical Pattern-Recognition Approach to Speech Recognition
• Feature measurement: in which sequence of measurement is made on the input signal to define “test pattern”. The feature measurements are usually the o/p of spectral analysis technique, such as filter bank analyzer, a LPC, or a DFT analysis.
• Pattern training: creates a reference pattern (for different sound class) called as Template.
• Pattern classification: unknown test pattern is compared with each (sound) class reference pattern and a measure of distance between the test pattern and each reference pattern is computed.
• Decision logic: the reference pattern similarity (distance) scores are used to decide which reference pattern best matches the unknown test pattern.
68
The strengths and weakness of the Pattern-Recognition model
• The performance of the system is sensitive to the amount of training data available for creating sound class reference pattern.( more training, higher performance)
• The reference patterns are sensitive to the speaking environment and transmission characteristics of the medium used to create the speech. (because speech spectral characteristics are affected by transmission and background noise)
• The method is relatively insensitive syntax and semantics.
69
The strengths and weakness of the Pattern-Recognition model
• The system is insensitive to the sound class. So the techniques are applied to wide range of speech sounds (phrases).
• It is easy to incorporate the syntactic and semantic constraints directly into the pattern-recognition structure, thereby improving recognition accuracy and reducing computation.
70
AI Approaches To Speech Recognition
The basic idea of AI is to compile and incorporate the knowledge from variety of knowledge sources to solve the problem.
• Acoustic Knowledge: Knowledge related to sound or sense of hearing
• Lexical Knowledge: Knowledge of the words of the language. (decomposing words into sounds)
• Syntactic Knowledge: Knowledge of syntax
• Semantic Knowledge: Knowledge of the meaning of the language.
• Pragmatic Knowledge: (sense derived from meaning) inference ability necessary in resolving ambiguity of meaning based on ways in which words are generally used.
71
AI Approaches To Speech Recognition
1. Go to the refrigerator and get me a book .syntactically correct but semantically inconsistent.
2. The bears killed the rams.can be interpreted in two pragmatically different
ways (jungle / football)
3. Power plants colorless happily old.Syntactically unacceptable and semantically meaningless
4. Good ideas often run when least expected.Semantically inconsistent. (corrected by replacing run bycome)
72
AI Approaches To Speech Recognition
Several ways to integrate knowledge sources within a speech recognizer.
1 Bottom-Up
2 Top-Down
3 Black Board
73
AI Approaches To Speech Recognition
• “Bottom-Up” Approach: The lowest level processes (feature detection, phonetic decoding) precede higher level processes (lexical coding) in a sequential manner
74
AI Approaches To Speech Recognition
• “Top-Up” Approach: in this the language model generate word hypotheses that are matched against the speech signal, and syntactically and semantically meaningful sentences are built up on the basis of word match scores.
75
AI Approaches To Speech Recognition
76
AI Approaches To Speech Recognition
• “Black Board” Approach: All knowledge sources (KS) are considered independent