A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *,...

A Temporal Network of Support Vector A Temporal Network of Support Vector Machines for the Recognition of Machines for the Recognition of

Visual SpeechVisual SpeechMihaela Gordan*, Constantine Kotropoulos**, Ioannis Pitas**

*Faculty of Electronics and Telecommunications

Technical University of Cluj-Napoca

15 C. Daicoviciu, 3400 Cluj-Napoca, Romania**Department of Informatics, Aristotle University of Thessaloniki

Artificial Intelligence and Information Analysis Laboratory

GR-54006 Thessaloniki Box 451, GreeceThis work was supported by the European Union Research Training NetworkThis work was supported by the European Union Research Training Network

``Multi-modal Human-Computer Interaction (HPRN-CT-2000-00111)''``Multi-modal Human-Computer Interaction (HPRN-CT-2000-00111)''

Department of InformaticsDepartment of InformaticsAristotle University of ThessalonikiAristotle University of Thessaloniki

Brief OverviewBrief Overview• Visual speech recognition (lipreading): important component of

audiovisual speech recognition systems; emerging research field.• Support vector machines (SVMs): powerful classifiers for various

visual classification tasks (face recognition; medical image processing; object tracking)

Goal of this work: • to examine the suitability of using SVMs for visual speech recognition,

• by developing an SVM-based visual speech recognition system.

In brief: • we use SVMs for viseme recognition

& & • integrate them as nodes in a Viterbi decoding lattice• The good results: slightly higher WRR for very simple input

features; possibility of easy generalization to larger vocabulary tasks, encourage the continuation of research.


ContentsContents

1. State of the art & research trends 2. Principles of the proposed visual speech recognition

approach3. SVMs and their use for mouth shape recognition4. Modeling the temporal dynamics of visual speech5. Block diagram of the proposed visual speech

recognition system6. Experimental results7. Conclusions


1. State of the art & research trends1. State of the art & research trends• Visual speech recognition = recognize the spoken words based

on visual examination of speaker’s face only, mainly mouth area.

• State of the art for visual speech recognition: many methods reported, very different in respect to:

• the feature types (lip contour coordinates, GLDP, gray levels of mouth image);

• the classifier used (TDNN, HMM);

• the class definition.

• Active research trends in the area:• Find the most suitable features and classification techniques for efficient

discrimination between different mouth shapes, individual-independent

• Reduce the required processing of the mouth image to increase the speed;

• Find solutions to facilitate easy integration of audio and visual recognizer.

• Use of SVMs in speech recognition: recently employed in audio speech recognition with very good results; no attempts in visual speech recognition.Department of InformaticsDepartment of Informatics

Aristotle University of ThessalonikiAristotle University of Thessaloniki

“o” “f”

2. Principles of the proposed visual speech 2. Principles of the proposed visual speech recognition approach - Irecognition approach - I

• Visemes = basic units of visual speech basic shapes of the mouth during speech production.

• Discrimination between visemes pattern recognition problem:• Feature vector = a representation of the mouth image (e.g. at pixel level: gray

levels of the pixels in the mouth image scanned in raw order);

• Pattern classes = the different visemes (mouth shapes) during the pronunciation of the words from the dictionary.


• The proposed strategy:

Having a given visual speech recognition task (i.e. a given dictionary of words),

1. Find the phonetic description of each word;

2. Derive the viseme-to- phoneme mapping according to the application (will be one-to-many, due to the involvement of non-visible parts of vocal tract in speech production & dependent to the nationality of the speaker; no universal viseme-to-phoneme mapping currently available);

3. Use the phonetic words descriptions and the viseme-to-phoneme mapping to derive visemic words descriptions ( visemic models = sequences of mouth shapes that could produce the phonetic word realization).

2. Principles of the proposed visual speech 2. Principles of the proposed visual speech recognition approach - IIrecognition approach - II


2. Principles of the proposed visual speech 2. Principles of the proposed visual speech recognition approach - IIIrecognition approach - III


Viseme-to-phoneme mapping

Phonetic and visemic

word description models

3. SVMs and their use for mouth shape 3. SVMs and their use for mouth shape recognition - Irecognition - I


• SVMs = statistical learning classifiers based on optimal hyperplane algorithm:

• Minimize a bound on the empirical error & the complexity of the classifier

• Capable of learning in sparse high-dimensional spaces with few training examples.

• Classical SVMs solve 2-class pattern recognition problems:

= training examples;

= M-dimensional pattern

- indicates if example i is a negative / positive example

• Linear SVMs: the data to be classified are separable in their original domain

3. SVMs and their use for mouth shape 3. SVMs and their use for mouth shape recognition - IIrecognition - II


·=

• Nonlinear SVMs: the data to be classified are not separable in their original domain • We project the data in a higher dimensional Hilbert space, , where the

data are linearly separable, via the nonlinear mapping

• and express the dot product of the data by a kernel function:

the decision function of the SVM classifier is:

where: = the non-negative Lagrange multipliers associated with the QP aiming to maximize the distance between classes and the separating hyperplane; , = hyperplane’s parameters.

• The real valued output function of the SVM gives the degree of confidence in the class assignment.

• SVM = binary classifier need to train one SVM for each mouth shape (viseme).

• The features used: the gray levels of pixels in the mouth image scanned in raw order.

• The set of training patterns = common to all SVMs; just the labels assigned to each training pattern are different. Use only unambiguous positive & negative examples.

• Training patterns (mouth images) are preprocessed for normalization in respect to scale, translation and rotation.

3. SVMs and their use for mouth shape 3. SVMs and their use for mouth shape recognition - IIIrecognition - III


4. Modeling the temporal dynamics of visual 4. Modeling the temporal dynamics of visual speech - Ispeech - I


• Symbolic visemic description of a word = L-R sequence of visemes; no information about the relative duration of each viseme in the word realization (strongly person-dependent)

• Given: – the symbolic visemic description of a word

– the total number of frames in the word pronunciation

build the word model in the temporal domain by assuming any non-zero possible duration of each viseme = a temporal network of models for each symbolic visemic description, as a Viterbi lattice.

“one” =

4. Modeling the temporal dynamics of visual 4. Modeling the temporal dynamics of visual speech - IIspeech - II


IN

OUT

Node k

Node k+1

Sub-path i

Viterbi lattice d for the visemic word model wd;T=5

4. Modeling the temporal dynamics of visual 4. Modeling the temporal dynamics of visual speech - IIIspeech - III


• Node k = the measure of confidence in the realization of the viseme ok=“ah” at the timeframe tk=3. = the real-valued output of the SVM trained for the recognition of the viseme ok.

• Sub-path i = the transition probability from the state which generates ok=“ah” at timeframe tk=3 to the state which generates ok+1=“n” at timeframe tk+1=4. We assume equal transition probabilities.

• Path l = any connected path between the states IN and OUT in the Viterbi lattice.

• Confidence in path l from the Viterbi lattice d:

• Plausibility of producing the word model wd:

kokc

kokc

1kkooa

ldl

d cc , max

5. Block diagram of the proposed visual speech recognition system5. Block diagram of the proposed visual speech recognition system


w

ah

n

“one”

oa

ah

n

“one”

f

ao

r

“four”

.

.

.

c1

c2

cD

i=ar

g m

ax c

d

Result: i=1Word “one”

.

.

.

.

.

.

.

SVMoa

SVMoa

SVMahSVMn

• Task to be solved:Task to be solved: visual speech recognition of the first four digits in English• Experimental data:Experimental data: the visual part from Tulips1 audiovisual speech database• Implementation: Implementation:

• in C++, using the publicly available SVMLight toolkit

• writing the code for the Viterbi algorithm and additional modulesand integrating them into the visual speech recognizer

• Training strategy:Training strategy: 12 SVMs (one for each viseme class) with polynomial kernel, degree 3, C=1000.

• Test strategy:Test strategy: leave-one-out protocol train the system 12 times on 11 subjects, each time leaving out one subject for testing

24 test sequences/ word 4 words = 96 test sequences.• Performance evaluation:Performance evaluation: in terms of:

• Overall (average) WRR – compared to the similar results from literature;• 95% confidence intervals for the WRR of the proposed approach and for WRR of similar

approaches from literature.


6. Experimental results - I6. Experimental results - I

• Comparison:Comparison:• Slightly higher WRR and confidence intervals compared to the literature

• Exception: lower WRR than the best reported without delta features (87.5%), due to a much better localization of the ROI around lip contour in that case. However – our computational complexity is much lower (no need to redefine the ROI in each frame).


6. Experimental results - II6. Experimental results - II


7. Conclusions7. Conclusions• We examined the suitability of SVM classifiers for visual speech

recognition.• The temporal character of speech was modeled by integrating

SVMs with real valued output as nodes in a Viterbi decoding lattice.

• Performance evaluation of the system on a small visual speech recognition task show:– better WRR than the ones reported in literature,– even for the use of very simple features: directly the gray levels in the mouth

image SVMs = promising tool for visual speech recognition applications.• Future research’s goals: increase the WRR by:

including delta features; examining other SVM’s kernels; learning the state transition probabilities in the Viterbi decoding lattice

Date post:	17-Dec-2015
Category:	Documents
Upload:	diane-gregory
View:	213 times
Download:	0 times

A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *,...

Documents