BSL-1K: Scaling up co-articulated sign language ...vgg/publications/2020/... · BSL-1K: Scaling up...

BSL-1K: Scaling up co-articulated sign languagerecognition using mouthing cues

Samuel Albanie1?, Gul Varol1?, Liliane Momeni1, Triantafyllos Afouras1,Joon Son Chung1,2, Neil Fox3, and Andrew Zisserman1

1Visual Geometry Group, University of Oxford, UK2Naver Corporation, Seoul, South Korea

3Deafness, Cognition and Language Research Centre, University College London, UK{albanie,gul,liliane,afourast,joon,az}@robots.ox.ac.uk;

[email protected]

Abstract. Recent progress in fine-grained gesture and action classifi-cation, and machine translation, point to the possibility of automatedsign language recognition becoming a reality. A key stumbling block inmaking progress towards this goal is a lack of appropriate training data,stemming from the high complexity of sign annotation and a limitedsupply of qualified annotators. In this work, we introduce a new scalableapproach to data collection for sign recognition in continuous videos. Wemake use of weakly-aligned subtitles for broadcast footage together witha keyword spotting method to automatically localise sign-instances for avocabulary of 1,000 signs in 1,000 hours of video. We make the followingcontributions: (1) We show how to use mouthing cues from signers to ob-tain high-quality annotations from video data—the result is the BSL-1Kdataset, a collection of British Sign Language (BSL) signs of unprece-dented scale; (2) We show that we can use BSL-1K to train strong signrecognition models for co-articulated signs in BSL and that these mod-els additionally form excellent pretraining for other sign languages andbenchmarks—we exceed the state of the art on both the MSASL andWLASL benchmarks. Finally, (3) we propose new large-scale evaluationsets for the tasks of sign recognition and sign spotting and provide base-lines which we hope will serve to stimulate research in this area.

Keywords: Sign Language Recognition, Visual Keyword Spotting

1 Introduction

With the continual increase in the performance of human action recognitionthere has been a renewed interest in the challenge of recognising sign languagessuch as American Sign Language (ASL), British Sign Language (BSL), and Chi-nese Sign Language (CSL). Although in the past isolated sign recognition hasseen some progress, recognition of continuous sign language remains extremelychallenging [10]. Isolated signs, as in dictionary examples, do not suffer from the

? Equal contribution

2 S. Albanie et al.

naturally occurring complication of co-articulation (i.e. transition motions) be-tween preceding and subsequent signs, making them visually very different fromcontinuous signing. If we are to recognise ASL and BSL performed naturally bysigners, then we need to recognise co-articulated signs.

Similar problems were faced by Automatic Speech Recognition (ASR) andthe solution, as always, was to learn from very large scale datasets, using aparallel corpus of speech and text. In the vision community, a related pathwas taken with the modern development of automatic lip reading: first isolatedwords were recognised [16], and later sentences were recognised [15]—in bothcases tied to the release of large datasets. The objective of this paper is todesign a scalable method to generate large-scale datasets of continuous signing,for training and testing sign language recognition, and we demonstrate this forBSL. We start from the perhaps counter-intuitive observation that signers oftenmouth the word they sign simultaneously, as an additional signal [5, 55, 56],performing similar lip movements as for the spoken word. This differs from mouthgestures which are not derived from the spoken language [21]. The mouthinghelps disambiguate between different meanings of the same manual sign [62] orin some cases simply provides redundancy. In this way, a sign is not only definedby the hand movements and hand shapes, but also by facial expressions andmouth movements [20].

We harness word mouthings to provide a method of automatically annotatingcontinuous signing. The key idea is to exploit the readily available and abundantsupply of sign-language translated TV broadcasts that consist of an overlaidinterpreter performing signs and subtitles that correspond to the audio content.The availability of subtitles means that the annotation task is in essence one ofalignment between the words in the subtitle and the mouthings of the overlaidsigner. Nevertheless, this is a very challenging task: a continuous sign may lastfor only a fraction (e.g. 0.5) of a second, whilst the subtitles may last for severalseconds and are not synchronised with the signs produced by the signer; theword order of the English need not be the same as the word order of the signlanguage; the sign may not be mouthed; and furthermore, words may not besigned or may be signed in different ways depending on the context. For example,the word “fish” has a different visual sign depending on referring to the animalor the food, introducing additional challenges when associating subtitle wordsto signs.

To detect the mouthings we use visual keyword spotting—the task of deter-mining whether and when a keyword of interest is uttered by a talking face usingonly visual information—to address the alignment problem described above. Twofactors motivate its use: (1) direct lip reading of arbitrary isolated mouthings isa fundamentally difficult task, but searching for a particular known word withina short temporal window is considerably less challenging; (2) the recent avail-ability of large scale video datasets with aligned audio transcriptions [1, 17] nowallows for the training of powerful visual keyword spotting models [33, 53, 64]that, as we show in the experiments, work well for this application.

We make the following contributions: (1) we show how to use visual keywordspotting to recognise the mouthing cues from signers to obtain high-quality

BSL-1K: Scaling up co-articulated sign language recognition 3

Table 1. Summary of previous public sign language datasets: The BSL-1Kdataset contains, to the best of our knowledge, the largest source of annotated signdata in any dataset. It comprises of co-articulated signs outside a lab setting.

Dataset lang co-articulated #signs #annos (avg. per sign) #signers source

ASLLVD [4] ASL 7 2742 9K (3) 6 labDevisign [14] CSL 7 2000 24K (12) 8 labMSASL [34] ASL 7 1000 25K (25) 222 lexicons, webWLASL [40] ASL 7 2000 21K (11) 119 lexicons, web

S-pot [59] FinSL 3 1211 4K (3) 5 labPurdue RVL-SLLL [61] ASL 3 104 2K (19) 14 labVideo-based CSL [32] CSL 3 178 25K (140) 50 labSIGNUM [60] DGS 3 455 3K (7) 25 labRWTH-Phoenix [10, 35] DGS 3 1081 65K (60) 9 TVBSL Corpus [52] BSL 3 5K 50K (10) 249 lab

BSL-1K BSL 3 1064 273K (257) 40 TV

annotations from video data—the result is the BSL-1K dataset, a large-scalecollection of BSL (British Sign Language) signs with a 1K sign vocabulary; (2)We show the value of BSL-1K by using it to train strong sign recognition modelsfor co-articulated signs in BSL and demonstrate that these models additionallyform excellent pretraining for other sign languages and benchmarks—we exceedthe state of the art on both the MSASL and WLASL benchmarks with thisapproach; (3) We propose new evaluation datasets for sign recognition and signspotting and provide baselines for each of these tasks to provide a foundation forfuture research1.

2 Related Work

Sign language datasets. We begin by briefly reviewing public benchmarks forstudying automatic sign language recognition. Several benchmarks have beenproposed for American [4, 34, 40, 61], German [35, 60], Chinese [14, 32], andFinnish [59] sign languages. BSL datasets, on the other hand, are scarce. Oneexception is the ongoing development of the linguistic corpus [51, 52] which pro-vides fine-grained annotations for the atomic elements of sign production. Whilstits high annotation quality provides an excellent resource for sign linguists, theannotations span only a fraction of the source videos so it is less appropriate fortraining current state-of-the-art data-hungry computer vision pipelines.

Tab. 1 presents an overview of publicly available datasets, grouped accordingto their provision of isolated signs or co-articulated signs. Earlier datasets havebeen limited in the size of their video instances, vocabularies, and signers. Withinthe isolated sign datasets, Purdue RVL-SLLL [61] has a limited vocabulary of104 signs (ASL comprises more than 3K signs in total [58]). ASLLVD [4] hasonly 6 signers. Recently, MSASL [34] and WLASL [40] large-vocabulary isolated

1 The project page is at: https://www.robots.ox.ac.uk/~vgg/research/bsl1k/

https://www.robots.ox.ac.uk/~vgg/research/bsl1k/

4 S. Albanie et al.

sign datasets have been released with 1K and 2K signs, respectively. The videosare collected from lexicon databases and other instructional videos on the web.

Due to the difficulty of annotating co-articulated signs in long videos, con-tinuous datasets have been limited in their vocabulary, and most of them havebeen recorded in lab settings [32, 60, 61]. RWTH-Phoenix [35] is one of the fewrealistic datasets that supports training complex models based on deep neuralnetworks. A recent extension also allows studying sign language translation [10].However, the videos in [10, 35] are only from weather broadcasts, restricting thedomain of discourse. In summary, the main constraints of the previous datasetsare one or more of the following: (i) they are limited in size, (ii) they have a largetotal vocabulary but only of isolated signs, or (iii) they consist of natural co-articulated signs but cover a limited domain of discourse. The BSL-1K datasetprovides a considerably greater number of annotations than all previous publicsign language datasets, and it does so in the co-articulated setting for a largedomain of discourse.

Sign language recognition. Early work on sign language recognition focusedon hand-crafted features computed for hand shape and motion [24, 25, 54, 57].Upper body and hand pose have then been widely used as part of the recognitionpipelines [7, 9, 19, 48, 50]. Non-manual features such as face [24, 35, 47], andmouth [3, 36, 38] shapes are relatively less considered. For sequence modelling ofsigns, HMMs [2, 23, 27, 54], and more recently LSTMs [9, 32, 65, 67], have beenutilised. Koller et al. [39] present a hybrid approach based on CNN-RNN-HMMto iteratively re-align sign language videos to the sequence of sign annotations.More recently 3D CNNs have been adopted due to their representation capacityfor spatio-temporal data [6, 8, 31, 34, 40]. Two recent concurrent works [34, 40]showed that I3D models [13] significantly outperform their pose-based counter-parts. In this paper, we confirm the success of I3D models, while also showingimprovements using pose distillation as pretraining. There have been efforts touse sequence-to-sequence translation models for sign language translation [10],though this has been limited to the weather discourse of RWTH-Phoenix, andthe method is limited by the size of the training set. The recent work of [41]localises signs in continuous news footage to improve an isolated sign classifier.

In this work, we utilise mouthings to localise signs in weakly-supervisedvideos. Previous work [7, 17, 18, 50] has used weakly aligned subtitles as asource of training data, and both one-shot [50] (from a visual dictionary) andzero-shot [6] (from a textual description) have also been used. Though no pre-vious work, to our knowledge, has put these ideas together. The sign spottingproblem was formulated in [22, 59].

Using the mouth patterns. The mouth has several roles in sign languagethat can be grouped into spoken components (mouthings) and oral components(mouth gestures) [62]. Several works focus on recognising mouth shapes [3, 38]to recover mouth gestures. Few works [36, 37] attempt to recognise mouthingsin sign language data by focusing on a few categories of visemes, i.e., visualcorrespondences of phonemes in the lip region [26]. Most closely related to ourwork, [49] similarly searches subtitles of broadcast footage and uses the mouthas a cue to improve alignment between the subtitles and the signing. Two key


Subs: Are you all happy with this application?

Locate occurence of target word e.g. "happy" in subtitles. Build mouthing search window from the s second window when subtitle appears, padded by

p seconds on either side (to account for misalignment).

Keyword spotter locates precise 0.6 second window containing "happy" sign

subtitle appears

Keyword Spotter

Result: BSL-1K

Stage 1:

Stage 2:

paddingVisual keyword spotting search window

Keyword Spotting Annotation Pipeline

Happy

Happy

Perfect

Strong

Accept

padding

Fig. 1. Keyword-driven sign annotation: (Left, the annotation pipeline): Stage 1:for a given target sign (e.g. “happy”) each occurrence of the word in the subtitles pro-vides a candidate temporal window when the sign may occur (this is further paddedby several seconds on either side to account for misalignment of subtitles and signs);Stage 2: a keyword spotter uses the mouthing of the signer to perform precise locali-sation of the sign within this window. (Right): Examples from the BSL-1K dataset—produced by applying keyword spotting for a vocabulary of 1K words.

differences between our work and theirs are: (1) we achieve precise localisationthrough keyword spotting, whereas they only use an open/closed mouth classifierto reduce the number of candidates for a given sign; (2) scale—we gather signsover 1,000 hours of signing (in contrast to the 30 hours considered in [49]).

3 Learning Sign Recognition with Automatic Labels

In this section, we describe the process used to collect BSL-1K, a large-scaledataset of BSL signs. An overview of the approach is provided in Fig. 1. InSec. 3.1, we describe how large numbers of video clips that are likely to containa given sign are sourced from public broadcast footage using subtitles; in Sec. 3.2,we show how automatic keyword spotting can be used to precisely localise specificsigns to within a fraction of a second; in Sec. 3.3, we apply this technique toefficiently annotate a large-scale dataset with a vocabulary of 1K signs.

3.1 Finding probable signing windows in public broadcast footage

The source material for the dataset comprises 1,412 episodes of publicly broad-cast TV programs produced by the BBC which contains 1,060 hours of contin-uous BSL signing. The episodes cover a wide range of topics: medical dramas,history and nature documentaries, cooking shows and programs covering gar-dening, business and travel. The signing represents a translation (rather thana transcription) of the content and is produced by a total of forty professionalBSL interpreters. The signer occupies a fixed region of the screen and is cropped

6 S. Albanie et al.

directly from the footage. A full list of the TV shows that form BSL-1K can befound in Appendix C.2. In addition to videos, these episodes are accompanied bysubtitles (numbering approximately 9.5 million words in total). To locate tem-poral windows in which instances of signs are likely to occur within the sourcefootage, we first identify a candidate list of words that: (i) are present in thesubtitles; (ii) have entries in both BSL signbank2 and sign BSL3, two onlinedictionaries of isolated signs (to ensure that we query words that have validmappings to signs). The result is an initial vocabulary of 1,350 words, which areused as queries for the keyword spotting model to perform sign localisation—thisprocess is described next.

3.2 Precise sign localisation through visual keyword spotting

By searching the content of the subtitle tracks for instances of words in the initialvocabulary, we obtain a set of candidate temporal windows in which instances ofsigns may occur. However, two factors render these temporal proposals extremelynoisy: (1) the presence of a word in the subtitles does not guarantee its presencein the signing; (2) even for subtitled words that are signed, we find throughinspection that their appearance in the subtitles can be misaligned with the signitself by several seconds.

To address this challenge, we turn to visual keyword spotting. Our goal isto detect and precisely localise the presence of a sign by identifying its “spo-ken components” [56] within a temporal sequence of mouthing patterns. Twohypotheses underpin this approach: (a) that mouthing provides a strong locali-sation signal for signs as they are produced; (b) that this mouthing occurs withsufficient frequency to form a useful localisation cue. Our method is motivatedby studies in the Sign Linguistics literature which find that spoken componentsfrequently serve to identify signs—this occurs most prominently when the mouthpattern is used to distinguish between manual homonyms4 (see [56] for a detaileddiscussion). However, even if these hypotheses hold, the task remains extremelychallenging—signers typically do not mouth continuously and the mouthingsthat are produced may only correspond to a portion of the word [56]. For thisreason, existing lip reading approaches cannot be used directly (indeed, an ini-tial exploratory experiment we conducted with the state-of-the-art lip readingmodel of [1] achieved zero recall on five-hundred randomly sampled sentences ofsigner mouthings from the BBC source footage).

The key to the effectiveness of visual keyword spotting is that rather thansolving the general problem of lip reading, it solves the much easier problem ofidentifying a single token from a small collection of candidates within a shorttemporal window. In this work, we use the subtitles to construct such windows.The pipeline for automatic sign annotations therefore consists of two stages(Fig. 1, left): (1) For a given target sign e.g. “happy”, determine the times of

2 https://bslsignbank.ucl.ac.uk/3 https://www.signbsl.com/4 These are signs that use identical hand movements (e.g. “king” and “queen”) whose

meanings are distinguished by mouthings.

https://bslsignbank.ucl.ac.uk/

https://www.signbsl.com/


Lovely (603) Meeting (230) Success (119) Barbecue (57) Culture (35) Strict (19) Compass (8)

Fig. 2. BSL-1K sign frequencies: Log-histogram of instance counts for the 1,064words constituting the BSL-1K vocabulary, together with example signs. The long-taildistribution reflects the real setting in which some signs are more frequent than others.

all occurrences of this sign in the subtitles accompanying the video footage. Thesubtitle time provides a short window during which the word was spoken, butnot necessarily when its corresponding sign is produced in the translation. Wetherefore extend this candidate window by several seconds to increase the like-lihood that the sign is present in the sequence. We include ablations to assessthe influence of this padding process in Sec. 5 and determine empirically thatpadding by four seconds on each side of the subtitle represents a good choice.(2) The resulting temporal window is then provided, together with the targetword, to a keyword spotting model (described in detail in Sec. 4.1) which esti-mates the probability that the sign was mouthed at each time step (we applythe keyword spotter with a stride of 0.04 seconds—this choice is motivated bythe fact that the source footage has a frame rate of 25fps). When the keywordspotter asserts with high confidence that it has located a sign, we take the loca-tion of the peak posterior probability as an anchoring point for one endpoint ofa 0.6 second window (this value was determined by visual inspection to be suffi-cient for capturing individual signs). The peak probability is then converted intoa decision about whether a sign is present using a threshold parameter. To buildthe BSL-1K dataset, we select a value of 0.5 for this parameter after conductingexperiments (reported in Tab. 3) to assess its influence on the downstream taskof sign recognition performance.

3.3 BSL-1K dataset construction and validation

Following the sign localisation process described above, we obtain approximately280k localised signs from a set of 2.4 million candidate subtitles. To ensure thatthe dataset supports study of signer-independent sign recognition, we then com-pute face embeddings (using an SENet-50 [30] architecture trained for verifica-tion on the VGGFace2 dataset [11]) to group the episodes according to whichof the forty signers they were translated by. We partition the data into threesplits, assigning thirty-two signers for training, four signers for validation and

8 S. Albanie et al.

Table 2. Statistics of the proposed BSL-1K dataset: The Test-(manually veri-fied) split represents a sample from the Test-(automatic) split annotations that havebeen verified by human annotators (see Sec. 3.3 for details).

Set sign vocabulary sign annotations number of signers

Train 1,064 173K 32Val 1,049 36K 4

Test-(automatic) 1,059 63K 4Test-(manually verified) 334 2103 4

four signers for testing. We further sought to include an equal number of hear-ing and non-hearing signers (the validation and test sets both contain an equalnumber of each, the training set is approximately balanced with 13 hearing, 17non-hearing and 2 signers whose deafness is unknown). We then perform a fur-ther filtering step on the vocabulary to ensure that each word included in thedataset is represented with high confidence (at least one instance with confidence0.8) in the training partition, which produces a final dataset vocabulary of 1,064words (see Fig. 2 for the distribution and Appendix C.3 for the full word list).Validating the automatic annotation pipeline. One of the key hypothesesunderpinning this work is that keyword spotting is capable of correctly locatingsigns. We first verify this hypothesis by presenting a randomly sampled subsetof the test partition to a native BSL signer, who was asked to assess whether theshort temporal windows produced by the keyword spotting model with high con-fidence (each 0.6 seconds in duration) contained correct instances of the targetsign. A screenshot of the annotation tool developed for this task is provided inFig. A.4. A total of 1k signs were included in this initial assessment, of which 70%were marked as correct, 28% were marked as incorrect and 2% were marked asuncertain, validating the key idea behind the annotation pipeline. Possible rea-sons for incorrect marks include: BSL mouthing patterns are not always identicalto spoken English and mouthings many times do not represent the full word (e.g.,“fsh” for “finish”) [56].Constructing a manually verified test set. To construct a high quality, hu-man verified test set and to maximise yield from the annotators, we started froma collection of sign predictions where the keyword model was highly confident(assigning a peak probability of greater than 0.9) yielding 5,826 sign predictions.Then, in addition to the validated 980 signs (corrections were provided as labelsfor the signs marked as incorrect and uncertain signs were removed), we furtherexpanded the verified test set with non-native (BSL level 2 or above) signerswho annotated a further 2k signs. We found that signers with lower levels offluency were able to confidently assert that a sign was correct for a portion ofthe signs (at a rate of around 60%), but also annotated a large number of signsas “unsure”, making it challenging to use these annotations as part of the val-idation test for the effectiveness of the pipeline. Only signs marked as correctwere included into the final verified test set, which ultimately comprised 2,103annotations covering 334 signs from the 1,064 sign vocabulary. The statistics ofeach partition of the dataset are provided in Tab. 2. All experimental test set


results in this paper refer to performance on the verified test set (but we retainthe full automatic test set, which we found to be useful for development).

In addition to the keyword spotting approach described above, we exploretechniques for further dataset expansion based on other cues in Appendix A.7.

4 Models and Implementation Details

In this section, we first describe the visual keyword spotting model used to collectsigns from mouthings (Sec. 4.1). Next, we provide details of the model architec-ture for sign recognition and spotting (Sec. 4.2). Lastly, we describe a methodfor obtaining a good initialisation for the sign recognition model (Sec. 4.3).

4.1 Visual keyword spotting model

We use the improved visual-only keyword spotting model of Stafylakis et al. [53]from [46] (referred to in their paper as “P2G [53] baseline”), provided by theauthors. The model of [53] combines visual features with a fixed-length key-word embedding to determine whether a user-defined keyword is present inan input video clip. The performance of [53] is improved in [46] by switchingthe keyword encoder-decoder from grapheme-to-phoneme (G2P) to phoneme-to-grapheme (P2G).

In more detail, the model consists of four stages: (i) visual features are firstextracted from the sequence of face-cropped image frames from a clip (this isperformed using a 512× 512 SSD architecture [43] trained for face detection onWIDER faces [63]), (ii) a fixed-length keyword representation is built using aP2G encoder-decoder, (iii) the visual and keyword embeddings are concatenatedand passed through BiLSTMs, (iv) finally, a sigmoid activation is applied on theoutput to approximate the posterior probability that the keyword occurs inthe video clip for each input frame. If the maximum posterior over all framesis greater than a threshold, the clip is predicted to contain the keyword. Thepredicted location of the keyword is the position of the maximum posterior.Finally, non-maximum suppression is run with a temporal window of 0.6 secondsover the untrimmed source videos to remove duplicates.

4.2 Sign recognition model

We employ a spatio-temporal convolutional neural network architecture thattakes a multiple-frame video as input, and outputs class probabilities over signcategories. Specifically, we follow the I3D architecture [13] due to its success onaction recognition benchmarks, as well as its recently observed success on signrecognition datasets [34, 40]. To retain computational efficiency, we only use anRGB stream. The model is trained on 16-frame consecutive frames (i.e., 0.64 secfor 25fps), as [7, 49, 59] observed that co-articulated signs last roughly for 13frames. We resize our videos to have a spatial resolution of 224× 224. For train-ing, we randomly subsample a fixed-size, temporally contiguous input from thespatio-temporal volume to have 16× 224× 224 resolution in terms of number of

10 S. Albanie et al.

frames, width, and height, respectively. We minimise the cross-entropy loss usingSGD with momentum (0.9) with mini-batches of size 4, and an initial learningrate of 10−2 with a fixed schedule. The learning rate is decreased twice with afactor of 10−1 at epochs 20 and 40. We train for 50 epochs. Colour, scale, andhorizontal flip augmentations are applied on the input video. When pretrainingis used (e.g. on Kinetics-400 [13] or on other data where specified), we replace thelast linear layer with the dimensionality of our classes, and fine-tune all networkparameters (we observed that freezing part of the model is suboptimal). Finally,we apply dropout on the classification layer with a probability of 0.5.

At test time, we perform centre-cropping and apply a sliding window with astride of 8 frames before averaging the classification scores to obtain a video-levelprediction.

4.3 Video pose distillation

Given the significant focus on pose estimation in the sign language recognitionliterature, we investigate how explicit pose modelling can be used to improvethe I3D model. To this end, we define a pose distillation network that takes ina sequence of 16 consecutive frames, but rather than predicting sign categories,the 1024-dimensional (following average pooling) embedding produced by thenetwork is used to regress the poses of individuals appearing in each of the framesof its input. In more detail, we assume a single individual per-frame (as is thecase in cropped sign translation footage) and task the network with predicting130 human pose keypoints (18 body, 21 per hand, and 70 facial) produced by anOpenPose [12] model (trained on COCO [42]) that is evaluated per-frame. Thekey idea is that, in order to effectively predict pose across multiple frames froma single video embedding, the model is encouraged to encode information notonly about pose, but also descriptions of relevant dynamic gestures. The modelis trained on a portion of the BSL-1K training set (due to space constraints,further details of the model architecture and training procedure are provided inAppendix B).

5 Experiments

We first provide several ablations on our sign recognition model to answer ques-tions such as which cues are important, and how to best use human pose. Then,we present baseline results for sign recognition and sign spotting, with our bestmodel. Finally, we compare to the state of the art on ASL benchmarks to illus-trate the benefits of pretraining on our data.

5.1 Ablations for the sign recognition model

In this section, we evaluate our sign language recognition approach and investi-gate (i) the effect of mouthing score threshold, (ii) the comparison to pose-basedapproaches, (iii) the contribution of multi-modal cues, and (iv) the video posedistillation. Additional ablations about the influence of the search window size


Table 3. Trade-off between training noise vs. size: Training (with Kineticsinitialisation) on the full training set BSL-1Km.5 versus the subset BSL-1Km.8, whichcorrespond to a mouthing score threshold of 0.5 and 0.8, respectively. Even when noisy,with the 0.5 threshold, mouthings provide automatic annotations that allow supervisedtraining at scale, resulting in 70.61% accuracy on the manually validated test set.

per-instance per-classTraining data #videos top-1 top-5 top-1 top-5

BSL-1Km.8 (mouthing≥0.8) 39K 69.00 83.79 45.86 64.42BSL-1Km.5 (mouthing≥0.5) 173K 70.61 85.26 47.47 68.13

for the keyword spotting (Appendix A.2) and the temporal extent of the auto-matic annotations (Appendix A.3) can be found in the appendix.

Evaluation metrics. Following [34, 40], we report both top-1 and top-5 clas-sification accuracy, mainly due to ambiguities in signs which can be resolved incontext. Furthermore, we adopt both per-instance and per-class accuracy met-rics. Per-instance accuracy is computed over all test instances. Per-class accuracyrefers to the average over the sign categories present in the test set. We use thismetric due to the unbalanced nature of the datasets.

The effect of the mouthing score threshold. The keyword spotting method,being a binary classification model, provides a confidence score, which we thresh-old to obtain our automatically annotated video clips. Reducing this thresholdyields an increased number of sign instances at the cost of a potentially noisierset of annotations. We denote the training set defined by a mouthing threshold0.8 as BSL-1Km.8. In Tab. 3, we show the effect of changing this hyper-parameterbetween a low- and high-confidence model with 0.5 and 0.8 mouthing thresholds,respectively. The larger set of training samples obtained with a threshold of 0.5provide the best performance. For the remaining ablations, we use the smallerBSL-1Km.8 training set for faster iterations, and return to the larger BSL-1Km.5

set for training the final model.

Pose-based model versus I3D. We next verify that I3D is a suitable modelfor sign language recognition by comparing it to a pose-based approach. Weimplement Pose→Sign, which follows a 2D ResNet architecture [29] that operateson 3 × 16 × P dimensional dynamic pose images, where P is the number ofkeypoints. In our experiments, we use OpenPose [12] (pretrained on COCO [42])to extract 18 body, 21 per hand, and 70 facial keypoints. We use 16-frame inputsto make it comparable to the I3D counterpart. We concatenate the estimatednormalised xy coordinates of a keypoint with its confidence score to obtain the3 channels. In Tab. 4, we see that I3D significantly outperforms the explicit 2Dpose-based method (65.57% vs 49.66% per-instance accuracy). This conclusionis in accordance with the recent findings of [34, 40].

Contribution of individual cues. We carry out two set of experiments todetermine how much our sign recognition model relies on signals from the mouthand face region versus the manual features from the body and hands: (i) usingPose→Sign, which takes as input the 2D keypoint locations over several frames,(ii) using I3D, which takes as input raw video frames. For the pose-based model,


Table 4. Contribution of individual cues: We compare I3D (pretrained on Kinet-ics) with a keypoint-based baseline both trained and evaluated on a subset of BSL-1Km.8, where we have the pose estimates. We also quantify the contribution of thebody&hands and the face regions. We see that significant information can be attributedto both types of cues, and the combination performs the best.

per-instance per-classbody&hands face top-1 top-5 top-1 top-5

Pose→Sign (70 points) 7 3 24.41 47.59 9.74 25.99Pose→Sign (60 points) 3 7 40.47 59.45 20.24 39.27Pose→Sign (130 points) 3 3 49.66 68.02 29.91 49.21

I3D (face-crop) 7 3 42.23 69.70 21.66 50.51I3D (mouth-masked) 3 7 46.75 66.34 25.85 48.02I3D (full-frame) 3 3 65.57 81.33 44.90 64.91

Table 5. Effect of pretraining the I3D model on various tasks before fine-tuning forsign recognition on BSL-1Km.8. Our dynamic pose features learned on 16-frame videosprovide body-motion-aware cues and outperform other pretraining strategies.

Pretraining per-instance per-classTask Data top-1 top-5 top-1 top-5

Random init. - 39.80 61.01 15.76 29.87Gesture recognition Jester [45] 46.93 65.95 19.59 36.44Sign recognition WLASL [40] 69.90 83.45 44.97 62.73Action recognition Kinetics [13] 69.00 83.79 45.86 64.42Video pose distillation Signers 70.38 84.50 46.24 65.31

we train with only 70 facial keypoints, 60 body&hand keypoints, or with thecombination. For I3D, we use the pose estimations to mask the pixels outsideof the face bounding box, to mask the mouth region, or use all the pixels fromthe videos. The results are summarised in Tab. 4. We observe that using onlythe face provides a strong baseline, suggesting that mouthing is a strong cue forrecognising signs, e.g., 42.23% for I3D. However, using all the cues, includingbody and hands (65.57%), significantly outperforms using individual modalities.Pretraining for sign recognition. Next we investigate different forms of pre-training for the I3D model. In Tab. 5, we compare the performance of a modeltrained with random initialisation (39.80%), fine-tuning from gesture recognition(46.93%), sign recognition (69.90%), and action recognition (69.00%). Video posedistillation provides a small boost over the other pretraining strategies (70.38%),suggesting that it is an effective way to force the I3D model to pay attention tothe dynamics of the human keypoints, which is relevant for sign recognition.

5.2 Benchmarking sign recognition and sign spotting

Next, we combine the parameter choices suggested by each of our ablations toestablish baseline performances on the BSL-1K dataset for two tasks: (i) signrecognition, (ii) sign spotting. Specifically, the model comprises an I3D architec-ture trained on BSL-1Km.5 with pose-distillation as initialisation and randomtemporal offsets of up to 4 frames around the sign during training (the ablationstudies for this temporal augmentation parameter are included in Appendix A.3).


sausage

Pred : sausage (0.90) GT : chips

Pred : happy (0.66)

GT : happen

happy

Pred : west (0.84) GT : wednesday

west

Pred : three (0.65)

GT : tree

three

Pred : orange (1.00) GT : orange

Pred : fantastic (1.00)

GT : fantasticPred : competition (1.00) GT : competition

Pred : before (1.00) GT : before

orange fantastic competition before

similar manual features similar mouthing

Fig. 3. Qualitative analysis: We present results of our sign recognition model onBSL-1K for success (top) and failure (bottom) cases, together with their confidencescores in parentheses. To the right of each example, we show a random training samplefor the predicted sign (in small). We observe that failure modes are commonly due tohigh visual similarity in the gesture (bottom-left) and mouthing (bottom-right).

Table 6. Benchmarking: We benchmark our best sign recognition model (trained onBSL-1Km.5, initialised with pose distillation, with 4-frame temporal offsets) for signrecognition and sign spotting task to establish strong baselines on BSL-1K.

per-instance per-classtop-1 top-5 top-1 top-5

Sign Recognition 75.51 88.83 52.76 72.14

mAP(334 sign classes)

Sign Spotting 0.159

The sign recognition evaluation protocol follows the experiments conducted inthe ablations, the sign spotting protocol is described next.Sign spotting. Differently from sign recognition, in which the objective is toclassify a pre-defined temporal segment into a category from a given vocabu-lary, sign spotting aims to locate all instances of a particular sign within longsequences of untrimmed footage, enabling applications such as content-basedsearch and efficient indexing of signing videos for which subtitles are not avail-able. The evaluation protocol for assessing sign spotting on BSL-1K is defined asfollows: for each sign category present amongst the human-verified test set an-notations (334 in total), windows of 0.6-second centred on each verified instanceare marked as positive and all other times within the subset of episodes thatcontain at least one instance of the sign are marked as negative. To avoid falsepenalties at signs that were not discovered by the automatic annotation process,we exclude windows of 8 seconds of footage centred at each location in the orig-inal footage at which the target keyword appears in the subtitles, but was notdetected by the visual keyword spotting pipeline. In aggregate this correspondsto locating approximately one positive instance of a sign in every 1.5 hours ofcontinuous signing negatives. A sign is considered to have been correctly spottedif its temporal overlap with the model prediction exceeds an IoU (intersection-over-union) of 0.5, and we report mean Average Precision (mAP) over the 334sign classes as the metric for performance.


Table 7. Transfer to ASL: Performance on American Sign Language (ASL) datasetswith and without pretraining on our data. I3D results are reported from the originalpapers for MSASL [34] and WLASL [40]. I3D† denotes our implementation and train-ing, adopting the hyper-parameters from [34]. We show that our features provide goodinitialisation, even if it is trained on BSL.

WLASL [40] MSASL [34]per-instance per-class per-instance per-class

pretraining top-1 top-5 top-1 top-5 top-1 top-5 top-1 top-5

I3D [34] Kinetics - - - - - - 57.69 81.08I3D [40] Kinetics 32.48 57.31 - - - - - -I3D† Kinetics 40.85 74.10 39.06 73.33 60.45 82.05 57.17 80.02I3D BSL-1K 46.82 79.36 44.72 78.47 64.71 85.59 61.55 84.43

We report the performance of our strongest model for both the sign recogni-tion and sign spotting benchmarks in Tab. 6. In Fig. 3, we provide some qualita-tive results from our sign recognition method and observe some modes of failurewhich are driven by strong visual similarity in sign production.

5.3 Comparison with the state of the art on ASL benchmarks

BSL-1K, being significantly larger than the recent WLASL [40] and MSASL [34]benchmarks, can be used for pretraining I3D models to provide strong initialisa-tion for other datasets. Here, we transfer the features from BSL to ASL, whichare two distinct sign languages.

As models from [34] were not publicly available, we first reproduce the I3DKinetics pretraining baseline with our implementation to achieve fair compar-isons. We use 64-frame inputs as isolated signs in these datasets are significantlyslower than co-articulated signs. We then train I3D from BSL-1K pretrained fea-tures. Tab. 7 compares pretraining on Kinetics versus our BSL-1K data. BSL-1Kprovides a significant boost in the performance, outperforming the state-of-the-art results (46.82% and 64.71% top-1 accuracy). Find additional details, as wellas similar experiments on co-articulated datasets in Appendix A.6.

6 Conclusion

We have demonstrated the advantages of using visual keyword spotting to au-tomatically annotate continuous sign language videos with weakly-aligned sub-titles. We have presented BSL-1K, a large-scale dataset of co-articulated signsthat, coupled with a 3D CNN training, allows high-performance recognition ofsigns from a large-vocabulary. Our model has further shown beneficial as initiali-sation for ASL benchmarks. Finally, we have provided ablations and baselines forsign recognition and sign spotting tasks. A potential future direction is leveragingour automatic annotations and recognition model for sign language translation.

Acknowledgements. This work was supported by EPSRC grant ExTol. Wealso thank T. Stafylakis, A. Brown, A. Dutta, L. Dunbar, A. Thandavan, C. Cam-goz, O. Koller, H. V. Joze, O. Kopuklu for their help.

Bibliography

[1] Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence (2019) 2, 6

[2] Agris, U., Zieren, J., Canzler, U., Bauer, B., Kraiss, K.F.: Recent develop-ments in visual sign language recognition. Universal Access in the Informa-tion Society 6, 323–362 (2008) 4

[3] Antonakos, E., Roussos, A., Zafeiriou, S.: A survey on mouth modeling andanalysis for sign language recognition. In: IEEE International Conferenceand Workshops on Automatic Face and Gesture Recognition (2015) 4

[4] Athitsos, V., Neidle, C., Sclaroff, S., Nash, J., Stefan, A., Quan Yuan,Thangali, A.: The american sign language lexicon video dataset. In:CVPRW (2008) 3

[5] Bank, R., Crasborn, O., Hout, R.: Variation in mouth actions with manualsigns in sign language of the Netherlands (ngt). Sign Language & Linguistics14, 248–270 (2011) 2

[6] Bilge, Y.C., Ikizler, N., Cinbis, R.: Zero-shot sign language recognition: Cantextual data uncover sign languages? In: BMVC (2019) 4

[7] Buehler, P., Zisserman, A., Everingham, M.: Learning sign language bywatching TV (using weakly aligned subtitles). In: CVPR (2009) 4, 9

[8] Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Using convolutional3D neural networks for user-independent continuous gesture recognition.In: IEEE International Conference of Pattern Recognition, ChaLearn Work-shop (2016) 4

[9] Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: SubUNets: End-to-endhand shape and continuous sign language recognition. In: ICCV (2017) 4

[10] Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural signlanguage translation. In: CVPR (2018) 1, 3, 4, 24

[11] Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: Adataset for recognising faces across pose and age. In: International Confer-ence on Automatic Face and Gesture Recognition (2018) 7

[12] Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: real-time multi-person 2D pose estimation using Part Affinity Fields. In: arXivpreprint arXiv:1812.08008 (2018) 10, 11, 25

[13] Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new modeland the Kinetics dataset. In: CVPR (2017) 4, 9, 10, 12, 25

[14] Chai, X., Wang, H., Chen, X.: The devisign large vocabulary of chinesesign language database and baseline evaluations. Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chi-nese Academy of Sciences (CAS), Institute of Computing Technology, CAS(2014) 3

[15] Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentencesin the wild. In: CVPR (2017) 2


[16] Chung, J.S., Zisserman, A.: Lip reading in the wild. In: ACCV (2016) 2[17] Chung, J.S., Zisserman, A.: Signs in time: Encoding human motion as a

temporal image. In: Workshop on Brave New Ideas for Motion Representa-tions, ECCV (2016) 2, 4

[18] Cooper, H., Bowden, R.: Learning signs from subtitles: A weakly supervisedapproach to sign language recognition. In: CVPR (2009) 4

[19] Cooper, H., Pugeault, N., Bowden, R.: Reading the signs: A video basedsign dictionary. In: ICCVW (2011) 4

[20] Cooper, H., Holt, B., Bowden, R.: Sign language recognition. In: VisualAnalysis of Humans: Looking at People, chap. 27, pp. 539 – 562. Springer(2011) 2

[21] Crasborn, O.A., Van Der Kooij, E., Waters, D., Woll, B., Mesch, J.: Fre-quency distribution and spreading behavior of different types of mouth ac-tions in three sign languages. Sign Language & Linguistics (2008) 2

[22] Eng-Jon Ong, Koller, O., Pugeault, N., Bowden, R.: Sign spotting usinghierarchical sequential patterns with temporal intervals. In: CVPR (2014)4

[23] Farhadi, A., Forsyth, D.: Aligning ASL for statistical translation using adiscriminative word model. In: CVPR (2006) 4

[24] Farhadi, A., Forsyth, D.A., White, R.: Transfer learning in sign language.In: CVPR (2007) 4

[25] Fillbrandt, H., Akyol, S., Kraiss, K..: Extraction of 3D hand shape andposture from image sequences for sign language recognition. In: IEEE In-ternational SOI Conference (2003) 4

[26] Fisher, C.G.: Confusions among visually perceived consonants. Journal ofSpeech and Hearing Research 11(4), 796–804 (1968) 4

[27] Forster, J., Oberdorfer, C., Koller, O., Ney, H.: Modality combination tech-niques for continuous sign language recognition. In: Pattern Recognitionand Image Analysis (2013) 4

[28] Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J.: Connectionist tem-poral classification: Labelling unsegmented sequence data with recurrentneural networks. In: ICML (2006) 24

[29] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog-nition. In: CVPR (2016) 11

[30] Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation net-works. IEEE Transactions on Pattern Analysis and Machine Intelligence(2019) 7

[31] Huang, J., Zhou, W., Li, H., Li, W.: Sign language recognition using 3Dconvolutional neural networks. In: International Conference on Multimediaand Expo (ICME) (2015) 4

[32] Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign languagerecognition without temporal segmentation. In: AAAI (2018) 3, 4

[33] Jha, A., Namboodiri, V.P., Jawahar, C.V.: Word spotting in silent lipvideos. In: WACV (2018) 2

[34] Joze, H.R.V., Koller, O.: MS-ASL: A large-scale data set and benchmarkfor understanding american sign language. In: BMVC (2019) 3, 4, 9, 11, 14,24


[35] Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: To-wards large vocabulary statistical recognition systems handling multiplesigners. Computer Vision and Image Understanding 141, 108–125 (2015)3, 4, 24

[36] Koller, O., Ney, H., Bowden, R.: Read my lips: Continuous signer indepen-dent weakly supervised viseme recognition. In: ECCV (2014) 4

[37] Koller, O., Ney, H., Bowden, R.: Weakly supervised automatic transcriptionof mouthings for gloss-based sign language corpora. In: LREC Workshop onthe Representation and Processing of Sign Languages: Beyond the ManualChannel (2014) 4

[38] Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes for signlanguage. In: Third Workshop on Assistive Computer Vision and Robotics,ICCV (2015) 4

[39] Koller, O., Zargaran, S., Ney, H.: Re-Sign: Re-aligned end-to-end sequencemodelling with deep recurrent CNN-HMMs. In: CVPR (2017) 4

[40] Li, D., Opazo, C.R., Yu, X., Li, H.: Word-level deep sign language recog-nition from video: A new large-scale dataset and methods comparison. In:WACV (2019) 3, 4, 9, 11, 12, 14, 24

[41] Li, D., Yu, X., Xu, C., Petersson, L., Li, H.: Transferring cross-domainknowledge for video sign language recognition. In: CVPR (2020) 4

[42] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In:ECCV (2014) 10, 11, 25

[43] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg,A.C.: Ssd: Single shot multibox detector. In: ECCV (2016) 9

[44] Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-svms forobject detection and beyond. In: ICCV (2011) 24

[45] Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The Jester dataset: Alarge-scale video dataset of human gestures. In: ICCVW (2019) 12

[46] Momeni, L., Afouras, T., Stafylakis, T., Albanie, S., Zisserman, A.: Seeingwake words: Audio-visual keyword spotting. arXiv (2020) 9

[47] Nguyen, T.D., Ranganath, S.: Tracking facial features under occlusions andrecognizing facial expressions in sign language. In: International Conferenceon Automatic Face and Gesture Recognition (2008) 4

[48] Ong, E., Cooper, H., Pugeault, N., Bowden, R.: Sign language recognitionusing sequential pattern trees. In: CVPR (2012) 4

[49] Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign languageby watching tv (using co-occurrences). In: BMVC (2013) 4, 5, 9, 24

[50] Pfister, T., Charles, J., Zisserman, A.: Domain-adaptive discriminative one-shot learning of gestures. In: ECCV (2014) 4

[51] Schembri, A., Fenlon, J., Rentelis, R., Cormier, K.: British Sign Lan-guage Corpus Project: A corpus of digital video data and annotationsof British Sign Language 2008-2017 (Third Edition) (2017), http://www.bslcorpusproject.org 3, 24

[52] Schembri, A., Fenlon, J., Rentelis, R., Reynolds, S., Cormier, K.: Buildingthe British sign language corpus. Language Documentation & Conservation7, 136–154 (2013) 3, 24

http://www.bslcorpusproject.org

http://www.bslcorpusproject.org


[53] Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visualspeech recognition in-the-wild. In: ECCV (2018) 2, 9

[54] Starner, T.: Visual Recognition of American Sign Language Using Hid-den Markov Models. Master’s thesis, Massachusetts Institute of Technology(1995) 4

[55] Sutton-Spence, R.: Mouthings and simultaneity in British sign language. In:Simultaneity in Signed Languages: Form and Function, pp. 147–162. JohnBenjamins (2007) 2

[56] Sutton-Spence, R., Woll, B.: The Linguistics of British Sign Language: AnIntroduction. Cambridge University Press (1999) 2, 6, 8

[57] Tamura, S., Kawasaki, S.: Recognition of sign language motion images. Pat-tern Recognition 21(4), 343 – 353 (1988) 4

[58] Valli, C., University, G.: The Gallaudet Dictionary of American Sign Lan-guage. Gallaudet University Press (2005) 3

[59] Viitaniemi, V., Jantunen, T., Savolainen, L., Karppa, M., Laaksonen, J.:S-pot – a benchmark in spotting signs within continuous signing. In: LREC(2014) 3, 4, 9

[60] von Agris, U., Knorr, M., Kraiss, K.: The significance of facial featuresfor automatic sign language recognition. In: 2008 8th IEEE InternationalConference on Automatic Face Gesture Recognition (2008) 3, 4

[61] Wilbur, R.B., Kak, A.C.: Purdue RVL-SLLL American sign languagedatabase. School of Electrical and Computer Engineering Technical Report,TR-06-12, Purdue University, W. Lafayette, IN 47906. (2006) 3, 4

[62] Woll, B.: The sign that dares to speak its name: Echo phonology in Britishsign language (BSL). In: Boyes-Braem, P., Sutton-Spence, R. (eds.) Thehands are the head of the mouth: The mouth as articulator in sign lan-guages, pp. 87–98. Hamburg: Signum Press (2001) 2, 4

[63] Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: A face detection bench-mark. In: CVPR (2016) 9

[64] Yao, Y., Wang, T., Du, H., Zheng, L., Gedeon, T.: Spotting visual key-words from temporal sliding windows. In: Mandarin Audio-Visual SpeechRecognition Challenge (2019) 2

[65] Ye, Y., Tian, Y., Huenerfauth, M., Liu, J.: Recognizing american sign lan-guage gestures from within continuous videos. In: CVPRW (2018) 4

[66] Zhang, M., Lucas, J., Ba, J., Hinton, G.E.: Lookahead optimizer: k stepsforward, 1 step back. In: NeurIPS (2019) 25

[67] Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue networkfor continuous sign language recognition. CoRR abs/2002.03187 (2020) 4

APPENDIX

This document provides additional results (Section A), details about thevideo pose distillation model (Section B), and about the BSL-1K dataset (Sec-tion C).

A Additional Results

In this section, we present complementary results to the main paper. Section A.1provides a qualitative analysis. Additional experiments investigate the searchwindow size for mouthing (Section A.2), the number of frames for sign recogni-tion (Section A.3), the effect of masking the mouth at test time (Section A.4),ensembling part-specific models (Section A.5), the transfer to co-articulateddatasets (Section A.6), and the baselines using other cues (Section A.7).

A.1 Qualitative analysis

We provide a video on our project page1 to illustrate the automatically anno-tated training samples in our BSL-1K dataset, as well as the results of our signrecognition model on the manually verified test set. Figures A.1 and A.2 presentsome of these results. In Figure A.1, we provide training samples localised us-ing mouthing cues. In Figure A.2, we provide test samples classified by our I3Dmodel trained on the automatic annotations.

A.2 Size of the search window for visual keyword spotting

We investigate the influence of varying the extent of the temporal window arounda given subtitle during the visual keyword spotting phase of dataset collection.For this experiment, we run the visual keyword spotting model with differentsearch window sizes (centring these windows on the subtitle locations), and trainsign recognition models (following the protocol described in the main paper, us-ing Kinetics initialisation) on the resulting annotations. We find (Table A.1) that8-second extended search windows yield the strongest performance on the testset (which is fixed across each run)—we therefore use these for all experimentsused in the main paper.

A.3 Temporal extent of the automatic annotations

Keyword spotting provides a precise localisation in time, but does not determinethe duration of the sign. We observe that the highest mouthing confidence is ob-tained at the end of mouthing. We therefore take a certain number of framesbefore this peak to include in our sign classification training. In Table A.2, we

1 https://www.robots.ox.ac.uk/~vgg/research/bsl1k/

https://www.robots.ox.ac.uk/~vgg/research/bsl1k/

20

Fig. A.1. Mouthing results: Qualitative samples for the visual keyword spottingmethod for the keywords “happy” and “important”. We visualise the top 24 videoswith the most confident mouthing scores for each word. We note the visual similarityamong manual features which suggests that mouthing cues can be a good starting pointto automatically annotate training samples.

APPENDIX 21

Fig. A.2. Sign recognition results: Qualitative samples for our sign language recog-nition I3D model on the BSL-1K test set. We visualise the top 24 videos with thehighest classification scores for the signs “orange” and “business”, which appear to beall correctly classified.

22

Table A.1. The effect of the temporal window where we apply the visual keywordspotting model. Networks are trained on BSL-1Km.8 with Kinetics initialisation. De-creasing the window size increases the chance of missing the word, resulting in lesstraining data and lower performance. Increasing too much makes the keyword spot-ting task difficult, reducing the annotation quality. We found 8 seconds to be a goodcompromise, which we used in all other experiments in this paper.

per-instance per-classKeyword search window #videos top-1 top-5 top-1 top-5

1 sec 25.0K 60.10 75.42 36.62 53.832 sec 33.9K 64.91 80.98 40.29 59.634 sec 37.6K 68.09 82.79 45.35 63.648 sec 38.9K 69.00 83.79 45.86 64.42

16 sec 39.0K 65.91 81.84 39.51 59.03

Table A.2. The effect of the number of frames before the mouthing peak used fortraining. Networks are trained on BSL-1Km.8 with Kinetics initialisation.

per-instance per-class#frames top-1 top-5 top-1 top-5

16 59.53 77.08 36.16 58.4320 71.71 85.73 49.64 69.2324 69.00 83.79 45.86 64.42

experiment with this hyper-parameter and see that 20 frames is a good compro-mise for creating variation in training, while not including too many irrelevantframes. In all of our experiments, we used 24 frames, except in Table 6 whichcombines the best parameters from each ablation, where we used 20 frames.Note that our I3D model takes in 16 consecutive frames as input, which is slicedrandomly during training.

A.4 Masking the mouth region at test time

In Table A.3, we experiment with the test modes for the networks trained with(i) full-frames including the mouth, versus (ii) masking the mouth region. Ifthe mouth is masked only at test time, the performance drops from 65.57% to34.74% suggesting the model’s significant reliance on the mouth cues. The modelcan be improved to 46.75% if it is forced to focus on other regions by trainingwith masked mouth.

A.5 Late fusion of part-specific models.

We further experiment with ensembling two I3D networks, each specialising ondifferent parts of the human body, by averaging the classification scores, i.e.,late fusion. The results are summarised in Table A.4. We observe significantimprovements when combining a mouth-specific model (face-crop) with a body-specific model (mouth-masked), which suggests that forcing the network to focuson separate, but complementary signing cues (68.55%) can be more effective than

APPENDIX 23

important library

full-frame face-cropmouth-masked full-frame face-cropmouth-masked

Fig. A.3. Masking the mouth: Sample visualisations for the inputs described inTable 4 (for the signs “important” and “library”). We experiment with masking themouth region or cropping only the face region using the detected pose keypoints.

Table A.3. We complement Table 4 by investigating different test modes for I3D,when trained with or without the mouth pixels. The model trained with full-framesrelies significantly on the mouth, whose performance drops from 65.57% to 34.74%when the mouth is masked. The models are trained on the subset of BSL-1Km.8 wherepose estimates are available.

Test mouth-masked Test full-frameper-instance per-class per-instance per-class

top-1 top-5 top-1 top-5 top-1 top-5 top-1 top-5

Train mouth-masked 46.75 66.34 25.85 48.02 46.21 65.34 25.83 46.23Train full-frame 34.74 51.42 13.62 29.80 65.57 81.33 44.90 64.91

Table A.4. Ensembling part-specific models from Table 4. We observe that combiningthe I3D model trained only with the face and another model without the mouth (lastrow) achieves superior performance than using one model that inputs the full-frame.This suggests that disentangling manual and non-manual features, which are comple-mentary, for sign recognition is a promising direction. The models are trained on thesubset of BSL-1Km.8 where pose estimates are available.

per-instance per-classtop-1 top-5 top-1 top-5

face-crop 42.23 69.70 21.66 50.51mouth-masked 46.75 66.34 25.85 48.02full-frame 65.57 81.33 44.90 64.91

full-frame & face-crop 64.50 83.01 42.30 65.58full-frame & mouth-masked 68.09 81.33 46.29 65.41mouth-masked & face-crop 68.55 83.63 45.29 67.47

presenting the full-frames (65.57%). This procedure; however, involves additionalcomplexity of computing the human pose and training two separate models. Itis therefore only used for experimental purposes.

Figure A.3 presents sample visualisations for the masking procedure. Formouth-masking, we replace the box covering the mouth region with the averagepixel of the region. For face-cropping, we set pixels outside of the face region tozero (we observed divergence of training if the mean value was used).

24

A.6 Transferring BSL-1K pretrained model to other datasets

As explained in Section 5.3, we use our model pretrained on BSL-1K as initiali-sation for transferring to other datasets.Additional details on fine-tuning for ASL. For MSASL [34] and WLASL [40]isolated datasets on ASL, we have used the pretraining with the mouth-maskingto force the model to entirely pay attention to manual features. We also ob-served that some signs are identical between ASL and BSL; therefore, instead ofrandomly initialising the last classification layer, we have kept the weights corre-sponding to common words between BSL-1K and the ASL dataset. We observedslight improvements with both of these choices.Results on co-articulated datasets. Here, we report the results of trainingsign language recognition on two co-articulated datasets: (i) RWTH-PHOENIX-Weather-2014-T [10, 35] and (ii) BSL-Corpus [51, 52], with and without pre-training on BSL-1K.

Phoenix dataset is not directly applicable to our model due to the lack of sign-gloss alignment to train I3D with short clips of individual signs. We thereforeimplemented a simple CTC loss [28] to adapt I3D for Phoenix and obtained 5.6WER improvement with BSL-1K pretraining over Kinetics pretraining.

BSL-Corpus is a linguistic dataset, and has not been used for computer visionresearch so far. We defined a train/val/test split (8:1:1 ratio) for a subset of 6kannotations of 966 signs and obtained 24.4% vs 12.8% accuracy with/withoutBSL-1K pretraining. In this case, we have also kept the last-layer classificationweights for which the words are in common between BSL-Corpus and BSL-1Ksigns. We observed this to provide small gains over completely random initiali-sation of classification weights.

We conclude that our large-scale BSL-1K dataset provides a strong initialisa-tion for both co-articulated and isolated datasets; for a variety of sign languages:ASL (American), BSL (British), and DGS (German).

A.7 Dataset expansion through other cues and additional baselines

In addition to the experiments reported in the paper, we further implementedthe dataset labelling technique described in [49] which searches subtitles for signsand picks candidate temporal windows that maximise the area under the ROCcurve for positively and negatively labelled bags (here, a positive bag refers totemporal windows that occur within an approximately 400 frame interval centredon the target word). However, we found that without the use of the keywordspotting model for localisation, the annotations collected with this techniquewere extremely noisy, and the resulting model significantly under-performed allbaselines reported in the main paper (that were instead trained on BSL-1K).We also experimented with dataset expansion through training ensembles ofexemplar SVMs [44] for each episode on signs that were predicted as confidentpositives (greater than 0.8) by our strongest pretrained model (and using alltemporal windows that did not include the keyword as negatives). In this case,we found it challenging to calibrate SVM confidences (we explored both theparameters of the original paper, who discuss the difficulties of this process [44]

APPENDIX 25

and a range of other parameters) and expanded the dataset by a factor of three,but did not achieve a boost in model performance when training on the expandeddata.

B Video Pose Distillation

In this section, we give additional details of the video pose distillation modeldescribed in Section 4.3. The model architecture uses an I3D backbone [13] andtakes as input a sequence of 16 frames at 224×224 pixels. We remove the classifi-cation head used in the original architecture and replace it with a linear layer thatprojects the 1024-dimensional embedding to 4160 dimensions—this correspondsto a set of 16 per-frame predictions of the xy coordinates of the 130 human posekeypoints produced by an OpenPose [12] model (trained on COCO [42]). Thecoordinates are normalised (with respect to the dimensions of the input image)to lie in the range [0, 1] and an L2 loss is used to penalise inaccurate predictions.The training data for the pose distillation model comprises one-minute segmentsfrom each episode used in the BSL-1K training set. The model is trained for 100epochs using the Lookahead optimizer [66] with minibatches of 32 clips usinga learning rate of 0.1 (reduced by a factor of 10 after 50 epochs) and a weightdecay of 0.001.

C BSL-1K Dataset Details

C.1 Sign verification tool

In Figure A.4, we show a screenshot of the verification tool used by annotators toverify or reject signs found by the proposed keyword spotting method in the testset. Annotators have the ability to view the sign at reduced speed and indicatewhether the sign is correct, incorrect, or they are unsure.

C.2 Dataset source material

The BBC broadcast TV shows, together with their respective number of occur-rences in the source material used to construct the dataset are:

Countryfile: 266, Natural World: 70, Great British Railway Journeys: 122, HolbyCity: 261, Junior Masterchef: 24, Junior Bake Off: 22, Hairy Bikers Bakeation: 6,Masterchef The Professionals: 37, Doctor Who Sci Fi: 23, Great British Menu:110, A To Z Of Cooking: 24, Raymond Blanc Kitchen Secrets: 9, The Apprentice:88, Country Show Cook Off: 18, A Taste Of Britain: 20, Lorraine Pascale HowTo Be A: 6, Chefs Put Your Menu Where Your: 13, Simply Nigella: 7, TheRestaurant Man: 5, Hairy Bikers Best Of British: 27, Rip Off Britain Food: 20,Our Food Uk 4: 3, Disaster Chefs: 8, Terry And Mason Great Food Trip: 19,Gardeners World: 70, Paul Hollywood Pies Puds: 20, James Martin Food MapOf Britain: 10, Baking Made Easy: 6, Hairy Bikers Northern: 7, Nigel Slater

26

Fig. A.4. Manual annotation: A screenshot of the Whole-Sign Verification Tool.

Eating Together: 6, Raymond Blanc How To Cook Well: 6, Great British FoodRevival: 17, Great British Bake Off: 28, Two Greedy Italians: 4, Food Fighters:10, Hairy Bikers Mums Know Best: 9, Hairy Bikers Meals On Wheels: 6, PaulHollywood Bread 6: 5, Home Comfort At Christmas: 1

C.3 Dataset vocabulary

The 1,064 words which form the vocabulary for BSL-1K are:

abortion, about, above, absorb, accept, access, act, activity, actually, add, address,advance, advertise, afford, afghanistan, africa, afternoon, again, against, agree, aids,alcohol, all, already, always, amazed, america, angel, angry, animal, answer, anything,anyway, apple, apprentice, approach, april, apron, arch, archery, architect, area, argue,arm, army, around, arrive, arrogant, art, asian, ask, assess, atmosphere, attack, atten-tion, attitude, auction, australia, austria, automatic, autumn, average, award, awful,baby, back, bacon, bad, balance, ball, ballet, balloon, banana, bank, barbecue, base,basketball, bath, battery, beach, beat, because, bedroom, beef, been, before, behind,belfast, belgium, believe, belt, better, big, billion, bingo, bird, birmingham, birthday,biscuit, bite, bitter, black, blackpool, blame, blanket, blind, blonde, blood, blue, boat,body, bomb, bone, bonnet, book, booked, border, boring, born, borrow, boss, both, bot-tle, boundary, bowl, box, boxing, branch, brave, bread, break, breathe, brick, bridge,brief, brighton, bring, bristol, britain, brother, brown, budget, buffet, build, bulgaria,bull, bury, bus, bush, business, but, butcher, butter, butterfly, buy, by, cabbage, calcu-lator, calendar, call, camel, camera, can, canada, cancel, candle, cap, captain, caravan,cardiff, careful, carpenter, castle, casual, cat, catch, catholic, ceiling, cellar, certificate,chair, chalk, challenge, chance, change, character, charge, chase, cheap, cheat, check,cheeky, cheese, chef, cherry, chicken, child, china, chips, chocolate, choose, christmas,church, city, clean, cleaner, clear, clever, climb, clock, closed, clothes, cloud, club, cof-fee, coffin, cold, collapse, collect, college, column, combine, come, comedy, comfort-

APPENDIX 27

able, comment, communicate, communist, community, company, compass, competi-tion, complicated, compound, computer, concentrate, confident, confidential, confirm,continue, control, cook, copy, corner, cornwall, cottage, council, country, course, court,cousin, cover, cracker, crash, crazy, create, cricket, crisps, cross, cruel, culture, cup,cupboard, curriculum, custard, daddy, damage, dance, danger, daughter, deaf, debate,december, decide, decline, deep, degree, deliver, denmark, dentist, depend, deposit,depressed, depth, derby, desire, desk, detail, detective, devil, different, dig, disabled,disagree, disappear, disappoint, discuss, disk, distract, divide, dizzy, doctor, dog, dol-phin, donkey, double, downhill, drawer, dream, drink, drip, drive, drop, drunk, dry,dublin, dvd, each, early, east, easter, easy, edinburgh, egypt, eight, elastic, electric-ity, elephant, eleven, embarrassed, emotion, empty, encourage, end, engaged, england,enjoy, equal, escalator, escape, ethnic, europe, evening, everything, evidence, exact,exchange, excited, excuse, exeter, exhausted, expand, expect, expensive, experience,explain, express, extract, face, factory, fairy, fall, false, family, famous, fantastic, far,farm, fast, fat, father, fault, fax, february, feed, feel, fence, fifteen, fifty, fight, film,final, finance, find, fine, finish, finland, fire, fireman, first, fish, fishing, five, flag, flat,flock, flood, flower, fog, follow, football, for, foreign, forever, forget, fork, formal, for-ward, four, fourteen, fox, france, free, freeze, fresh, friday, friend, frog, from, front, fruit,frustrated, fry, full, furniture, future, game, garden, general, generous, geography, ger-many, ginger, girl, give, glasgow, glass, gold, golf, gorilla, gossip, government, grab,grandfather, grandmother, greedy, green, group, grow, guarantee, guess, guilty, gym,hair, half, hall, hamburger, hammer, hamster, handshake, hang, happen, happy, hard,hat, have, headache, hearing, heart, heavy, helicopter, hello, help, hide, history, holiday,holland, home, hope, hopeless, horrible, horse, hospital, hot, hotel, hour, house, how,hundred, hungry, hypocrite, idea, if, ignore, imagine, impact, important, impossible,improve, income, increase, independent, india, influence, inform, information, injec-tion, insert, instant, insurance, international, internet, interrupt, interview, introduce,involve, ireland, iron, italy, jamaica, january, japan, jealous, jelly, jersey, join, joke,jumper, just, kangaroo, karate, keep, kitchen, label, language, laptop, last, late, later,laugh, leaf, leave, leeds, left, lemon, library, lighthouse, lightning, like, line, link, list,little, live, liverpool, london, lonely, lost, love, lovely, machine, madrid, magic, make,man, manage, manchester, many, march, mark, marry, match, maybe, meaning, meat,mechanic, medal, meet, meeting, melon, member, mention, message, metal, mexico,milk, million, mind, minute, mirror, miserable, miss, mistake, mix, monday, money,monkey, month, more, morning, most, mother, motorbike, mountain, mouse, move,mum, music, must, name, nasty, national, naughty, navy, necklace, negative, nervous,never, newcastle, newspaper, next, nice, nightclub, normal, north, norway, not, noth-ing, notice, november, now, number, nursery, object, october, offer, office, old, on,once, one, onion, only, open, operate, opposite, or, oral, orange, order, other, out, oven,over, overtake, owl, own, pack, page, pager, paint, pakistan, panic, paper, paris, park,partner, party, passport, past, pattern, pay, payment, pence, penguin, people, per-cent, perfect, perhaps, period, permanent, person, personal, persuade, petrol, photo,piano, picture, pig, pineapple, pink, pipe, place, plain, plan, plaster, plastic, plate,please, plenty, plumber, plus, point, poland, police, politics, pop, popular, porridge,portsmouth, portugal, posh, poster, potato, pound, practise, praise, prefer, pregnant,president, pretend, price, priest, print, prison, problem, professional, professor, profile,profit, project, promise, promote, protestant, proud, provide, pub, pull, pulse, punch,purple, push, pyramid, quality, quarter, question, quick, quiet, quit, rabbit, race, radio,rain, rather, read, reading, ready, really, receipt, receive, recommend, red, reduce, re-gion, regular, relationship, relax, release, relief, remember, remind, remove, rent, repair,replace, research, resign, respect, retire, review, rhubarb, rich, ride, right, river, rocket,

28

roll, roman, room, roots, rough, round, rub, rubbish, rugby, run, russia, sack, sad,safe, same, sand, sandwich, satellite, saturday, sauce, sausage, school, science, scissors,scotland, scratch, search, second, seed, seem, self, sell, send, sense, sensitive, sentence,separate, september, sequence, service, settle, seven, sex, shadow, shakespeare, shame,shark, sharp, sheep, sheet, sheffield, shine, shirt, shop, should, shoulder, shout, show,shower, sick, sight, sign, silver, similar, since, sister, sit, six, size, skeleton, skin, sleep,sleepy, small, smell, smile, snake, soft, some, sometimes, son, soon, sorry, south, spain,specific, speech, spend, spicy, spider, spirit, split, sport, spray, spread, squash, squir-rel, staff, stand, star, start, station, still, story, straight, strange, stranger, strawberry,stress, stretch, strict, string, strong, structure, stubborn, stuck, student, study, stupid,success, sugar, summer, sun, sunday, sunset, support, suppose, surprise, swan, swap,sweden, sweep, swim, swing, switzerland, sympathy, take, talk, tap, taste, tax, taxi,teacher, team, technology, television, temperature, temporary, ten, tennis, tent, termi-nate, terrified, that, theory, therefore, thing, think, thirsty, thousand, three, through,thursday, ticket, tidy, tiger, time, tired, title, toast, together, toilet, tomato, tomor-row, toothbrush, toothpaste, top, torch, total, touch, tough, tournament, towel, town,train, training, tram, trap, travel, tree, trouble, trousers, true, try, tube, tuesday, turkey,twelve, twenty, twice, two, ugly, ultrasound, umbrella, uncle, under, understand, un-employed, union, unit, university, until, up, upset, valley, vegetarian, video, vinegar,visit, vodka, volcano, volunteer, vote, wait, wales, walk, wall, want, war, warm, wash,waste, watch, water, weak, weather, wednesday, week, weekend, well, west, wet, what,wheelchair, when, where, which, whistle, white, who, why, wicked, width, wild, will,win, wind, window, wine, with, without, witness, wolf, woman, wonder, wonderful,wood, wool, work, world, worry, worship, worth, wow, write, wrong, yes, yesterday.

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

BSL-1K: Scaling up co-articulated sign language ...vgg/publications/2020/... · BSL-1K: Scaling up...

Documents