+ All Categories
Transcript

Vassilis Pitsikalis, Stavros Theodorakis and Petros MaragosSchool of ECE, National Technical University of Athens 15773, Greece

Data-Driven Sub-Units and Modeling Structure for

Continuous Sign Language Recognition

with Multiple Cues

2. Outline-Contributions Visual Front-End + Feature Extraction: Hands’ centroid movement-position measurements (2D): positions, derivatives, velocity, dynamics.

Data-Driven Sub-Unit Construction exploiting Multiple Cues

Model-based sign segmentation + Dynamic vs. Static Labels

Dynamic vs. Static Specific Modeling: Features, Architecture, Parameters.

HMM Recognition framework + Evaluation

1. Sign Language – MotivationVisual patterns are formed by hand shapes, manual or general body motion and facial expressions.

Word in spoken language Sign.

Position, Movement, Hand Shape, Orientation, Facial.

Sub-Units: different nature compared to speech phoneme unit; lack of annotated data

Focus on automatic data-driven modeling of sub-units without any linguistic – phonological information.

Goal: continuous Sign Language Recognition.

3. Visual Front-EndHand and Head Detection

Probabilistic skin color model,

Color force in a Geodesic Active Regions model [2].

Occlusion Handling

Features: Movement-Position of the hands.

Simple Features Effect on Sub-Unit modeling .

Movement-Position: Among the main cues that describe a sign [4, 9].

Fitting ellipses to hands after

segmentation

4. Overview

7. Feature Normalizations

XXX

Y

X

8. Subunit Construction

D/S Modeling Structure: Appropriate feature/ model architecture for each case.

Feature Normalizations

Intuitive Subunits per cue

SU-1, SU-2

SU-3, SU-4

Y

Direction

SU-1

SU-2

Y

X

Scale

Y

X coordinate

Position

SU-1, SU-2

SU-3, SU-4

10. SU-Sequence to Sign Mapping

Apparent Signs

Sim

ilar

ity

Met

ric

Effective Signs

References[1] B. Bauer and K. F. Kraiss. 2001. Towards an automatic sign language recognition system using subunits. In Proc.of Int’ l Gesture Workshop, 2298 : 64–75.

[2] O. Diamanti and P. Maragos. 2008. Geodesic active regions for segmentation and tracking of human gestures insign language videos. In Proc. Int’l Conf. Image Processing (ICIP) 2008.

[3] P. Dreuw, C. Neidle, V. Athitsos, S. Sclaroff, and H. Ney. 2008. Benchmark databases for video-based automaticsign language recognition. In Proc. Int’l Conf. on Language Resources and Evaluation (LREC), May.

[4] K. Emmorey. 2002. Language, Cognition, and the Brain: Insights from Sign Language Research. Erlbaum.

[5] G. Fang, X. Gao, W. Gao, and Y. Chen. 2004. A novel approach to automatically extracting basic units fromchinese sign language. In Proc. Int’l Conf. Pattern Recognition 2004.

[6] J. Han, G. Awad, and A. Sutherland. 2009. Modelling and segmenting subunits for sign language recognitionbased on hand motion analysis. Pat. Rec. Lett., 30(6) : 623–633.

[7] B. H. Juang and L. R. Rabiner. 1985. A probabilistic distance for hidden markov models. AT & T Technical Journal.

[8] S. K. Liddell and R. E. Johnson. 1989. American sign language: The phonological base. Sign Language Studies, 64 :195 – 277.

[9] S. Ong and S. Ranganath. 2005. Automatic sign language analysis: A survey and the future beyond lexicalmeaning. In Pattern Analysis Machine Intelligence 2005, 27(6) : 873–891.

[10] N. Paragios and R. Deriche. 2002. Geodesic Active Regions: A New Framework to Deal with Frame PartitionProblems in Computer Vision. Journ. of Vis. Commun. and Image Repres., 13(1/2) : 249–268.

[11] L. R. Rabiner. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition.Proceedings of the IEEE, 77 : 257–286.

[12] P. Smyth. 1997. Clustering sequences with hidden markov models. In Advances in Neural Information ProcessingSystems, 9 : 648–654.

[13] C. Vogler and D. Metaxas. 2003. Handshapes and movements: Multiple-channel american sign languagerecognition. In Proc. of Int’ l Gesture Workshop, 247–258.

12. SL Recognition ExperimentsContinuous American Sign Language [3]: 843 utterances, 406 words, 4 signers, Uniform background. Sign level transcriptions; English Glosses; annotated start/end points.

BU400 HQ, 6 videos, 648×484 frames, 60fps. Most frequent Glosses (94); cross-validate; train/test 60-40%;

For further informationPlease contact [sth, vpitsik, maragos]@cs.ntua.gr. More information

can be found at http://cvsp.cs.ntua.gr and http://www.dictasign.eu.

Acknowledgments

This research work was supported by the EU under the research program Dictasign with grant FP7-ICT-3-231135. We also wish to thank Boston University and C. Neidle for providing the BU400 database.

5 6

Static Segment Dynamic Segment

Segmentation Point

6. Segmentation Sample 11.SU Sequence to Sign Map: Sample

SU-1, SU-2

SU-3, SU-4

X

Position

Dynamics

Zoom

# Clusters in Static SUs

Sig

n A

ccu

racy

%

K=21

Position

Sig

n A

ccu

racy

%

SU

Acc

ura

cy %

Dis

crim

inat

ion

Fac

tor

D2S2

Subunit instances (3) shown schematically on superimposed initial and final corresponding frames.

D1S4 D1S4

X

9. Lexicon Sample

BECAUSE P1: HP10

BECAUSE P2: HP10 MD1

BECAUSE P3: HP10 MD1 HP10

BETTER P3: MD1 HP2

BETTER P4: MD1 HP8

BECAUSE P1: HP10

BECAUSE P2: HP10 MS1

BECAUSE P3: HP10 MS1 HP10

BETTER P3: MS1 HP2

BETTER P4: MS1 HP8

BECAUSE P1: HP10

BECAUSE P2: HP10 MSPn1

BECAUSE P3: HP10 MSPn1 HP10

BETTER P3: MSPn1 HP2

BETTER P4: MSPn1 HP8

Gloss:SU-Seq

Features: Dynamic + Static

MSPn + P D + P S + P

5. Segmentation + Dynamic/Static

Automatic Segmentation

Manual Segmentation

(by LIMSI – CNRS)

Featu

re L

evel

Dynamic (D)

Static (S)

Accelerating (A)

Uniform (U)

Acc

eler

atio

n

FramesFrames

SS

D D

A

U U

Mo

del L

evel

vs.

Vel

oci

ty

Top Related