Vassilis Pitsikalis, Stavros Theodorakis and Petros MaragosSchool of ECE, National Technical University of Athens 15773, Greece
Data-Driven Sub-Units and Modeling Structure for
Continuous Sign Language Recognition
with Multiple Cues
2. Outline-Contributions Visual Front-End + Feature Extraction: Hands’ centroid movement-position measurements (2D): positions, derivatives, velocity, dynamics.
Data-Driven Sub-Unit Construction exploiting Multiple Cues
Model-based sign segmentation + Dynamic vs. Static Labels
Dynamic vs. Static Specific Modeling: Features, Architecture, Parameters.
HMM Recognition framework + Evaluation
1. Sign Language – MotivationVisual patterns are formed by hand shapes, manual or general body motion and facial expressions.
Word in spoken language Sign.
Position, Movement, Hand Shape, Orientation, Facial.
Sub-Units: different nature compared to speech phoneme unit; lack of annotated data
Focus on automatic data-driven modeling of sub-units without any linguistic – phonological information.
Goal: continuous Sign Language Recognition.
3. Visual Front-EndHand and Head Detection
Probabilistic skin color model,
Color force in a Geodesic Active Regions model [2].
Occlusion Handling
Features: Movement-Position of the hands.
Simple Features Effect on Sub-Unit modeling .
Movement-Position: Among the main cues that describe a sign [4, 9].
Fitting ellipses to hands after
segmentation
4. Overview
7. Feature Normalizations
XXX
Y
X
8. Subunit Construction
D/S Modeling Structure: Appropriate feature/ model architecture for each case.
Feature Normalizations
Intuitive Subunits per cue
SU-1, SU-2
SU-3, SU-4
Y
Direction
SU-1
SU-2
Y
X
Scale
Y
X coordinate
Position
SU-1, SU-2
SU-3, SU-4
10. SU-Sequence to Sign Mapping
Apparent Signs
Sim
ilar
ity
Met
ric
Effective Signs
References[1] B. Bauer and K. F. Kraiss. 2001. Towards an automatic sign language recognition system using subunits. In Proc.of Int’ l Gesture Workshop, 2298 : 64–75.
[2] O. Diamanti and P. Maragos. 2008. Geodesic active regions for segmentation and tracking of human gestures insign language videos. In Proc. Int’l Conf. Image Processing (ICIP) 2008.
[3] P. Dreuw, C. Neidle, V. Athitsos, S. Sclaroff, and H. Ney. 2008. Benchmark databases for video-based automaticsign language recognition. In Proc. Int’l Conf. on Language Resources and Evaluation (LREC), May.
[4] K. Emmorey. 2002. Language, Cognition, and the Brain: Insights from Sign Language Research. Erlbaum.
[5] G. Fang, X. Gao, W. Gao, and Y. Chen. 2004. A novel approach to automatically extracting basic units fromchinese sign language. In Proc. Int’l Conf. Pattern Recognition 2004.
[6] J. Han, G. Awad, and A. Sutherland. 2009. Modelling and segmenting subunits for sign language recognitionbased on hand motion analysis. Pat. Rec. Lett., 30(6) : 623–633.
[7] B. H. Juang and L. R. Rabiner. 1985. A probabilistic distance for hidden markov models. AT & T Technical Journal.
[8] S. K. Liddell and R. E. Johnson. 1989. American sign language: The phonological base. Sign Language Studies, 64 :195 – 277.
[9] S. Ong and S. Ranganath. 2005. Automatic sign language analysis: A survey and the future beyond lexicalmeaning. In Pattern Analysis Machine Intelligence 2005, 27(6) : 873–891.
[10] N. Paragios and R. Deriche. 2002. Geodesic Active Regions: A New Framework to Deal with Frame PartitionProblems in Computer Vision. Journ. of Vis. Commun. and Image Repres., 13(1/2) : 249–268.
[11] L. R. Rabiner. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition.Proceedings of the IEEE, 77 : 257–286.
[12] P. Smyth. 1997. Clustering sequences with hidden markov models. In Advances in Neural Information ProcessingSystems, 9 : 648–654.
[13] C. Vogler and D. Metaxas. 2003. Handshapes and movements: Multiple-channel american sign languagerecognition. In Proc. of Int’ l Gesture Workshop, 247–258.
12. SL Recognition ExperimentsContinuous American Sign Language [3]: 843 utterances, 406 words, 4 signers, Uniform background. Sign level transcriptions; English Glosses; annotated start/end points.
BU400 HQ, 6 videos, 648×484 frames, 60fps. Most frequent Glosses (94); cross-validate; train/test 60-40%;
For further informationPlease contact [sth, vpitsik, maragos]@cs.ntua.gr. More information
can be found at http://cvsp.cs.ntua.gr and http://www.dictasign.eu.
Acknowledgments
This research work was supported by the EU under the research program Dictasign with grant FP7-ICT-3-231135. We also wish to thank Boston University and C. Neidle for providing the BU400 database.
5 6
Static Segment Dynamic Segment
Segmentation Point
6. Segmentation Sample 11.SU Sequence to Sign Map: Sample
SU-1, SU-2
SU-3, SU-4
X
Position
Dynamics
Zoom
# Clusters in Static SUs
Sig
n A
ccu
racy
%
K=21
Position
Sig
n A
ccu
racy
%
SU
Acc
ura
cy %
Dis
crim
inat
ion
Fac
tor
D2S2
Subunit instances (3) shown schematically on superimposed initial and final corresponding frames.
D1S4 D1S4
X
9. Lexicon Sample
BECAUSE P1: HP10
BECAUSE P2: HP10 MD1
BECAUSE P3: HP10 MD1 HP10
BETTER P3: MD1 HP2
BETTER P4: MD1 HP8
BECAUSE P1: HP10
BECAUSE P2: HP10 MS1
BECAUSE P3: HP10 MS1 HP10
BETTER P3: MS1 HP2
BETTER P4: MS1 HP8
BECAUSE P1: HP10
BECAUSE P2: HP10 MSPn1
BECAUSE P3: HP10 MSPn1 HP10
BETTER P3: MSPn1 HP2
BETTER P4: MSPn1 HP8
Gloss:SU-Seq
Features: Dynamic + Static
MSPn + P D + P S + P
5. Segmentation + Dynamic/Static
Automatic Segmentation
Manual Segmentation
(by LIMSI – CNRS)
Featu
re L
evel
Dynamic (D)
Static (S)
Accelerating (A)
Uniform (U)
Acc
eler
atio
n
FramesFrames
SS
D D
A
U U
Mo
del L
evel
vs.
Vel
oci
ty