Date post: | 04-Jun-2018 |
Category: |
Documents |
Upload: | phungtuyen |
View: | 218 times |
Download: | 0 times |
FP7 - MOBOT – 600796 1
EU FP7-ICT-2011.2.1
ICT for Cognitive Systems and Robotics - 600796
Work Package 2: Human Behaviour Analysis and Mobility Assistance Models
Deliverable D2.2: Multimodal sensory corpora annotations
Release date: 15-09-2014
Status: public
Document name: MOBOT_WP2_D2.2
WP2: Work Package Nr. D2.2: Deliverable Number (compare DoW)
FP7 - MOBOT – 600796 2
EXECUTIVE SUMMARY
This deliverable aims at describing the procedures followed throughout the post-
processing of the multisensory data as well as the annotation of the visual and audio
data that were acquired during the recording/measurement sessions that took place in
Agaplesion Bethanien Hospital/Geriatric Centre at the University of Heidelberg in
November 2013.
The data from the motion capture system were processed in order to gather precise
kinematic information from every action of the MOBOT patient related to the use of
and interaction with the MOBOT passive-rollator device. Such data contribute to the
understanding of elderly-specific motion sequences and general behaviour in MOBOT
related scenarios and also serve as an important input for other research areas involved
in the MOBOT project, such as gait pattern analysis and classification, motion
recognition, safety analysis, on-line control and optimization, as well as the mechanical
design of the device.
The post-processing of motion capture data generally includes two main steps: labelling
and cleaning the raw data in Qualisys software, and extracting human motion data in
Visual3D software. The use of the image-based motion capture system (Qualisys) was
preferred over the initially planned IMU-based system (XSens) since it proved to
provide much more precise results and could be adjusted in accordance with the other
sensors involved in the trials. However, the use of the image-based system in the post-
processing of the recorded data required much more effort than what had been
anticipated for the IMU-based recordings.
The post-processing of audiovisual data serves two goals; firstly, it serves for the
synchronization of all media; secondly, the in-depth annotation of the data provides
timestamps for actions, gestures and speech as well as for audio and visual noise,
supplying all technical partners with measurable information about the content of the
acquired data. The analysis, testing and implementation of these data will become an
important source for the different modules of the MOBOT robotic platform.
In the following sections of this deliverable there will first be a brief introduction
regarding the multisensory data; subsequent, there is a section focusing on the post-
processing of the acquired data offering some quantitative information and describing
the synchronization procedure for all the multisensory data and the creation of the Pip
video files for the annotation procedure. The third and final section of this deliverable
focuses on the conception and realisation of the audiovisual data annotation schemes.
FP7 - MOBOT – 600796 3
Deliverable Identification Sheet
IST Project No. FP7 – ICT for Cognitive Systems and Robotics - 600796
Acronym MOBOT
Full title Intelligent Active MObility Assistance RoBOT integrating
Multimodal Sensory Processing, Proactive Autonomy and Adaptive
Interaction
Project URL http://www.mobot-project.eu
EU Project
Officer
Michel Brochard
Deliverable D2.2 Multimodal sensory corpora annotations
Work package WP2 Human Behaviour Analysis and Mobility Assistance
Models
Date of delivery Contractual M 18 Actual 15-09-2014
Status Final
Nature Other
Dissemination
Level
Public
Authors
(Partner)
Evita Fotinea (ATHENA), Athanasia - Lida Dimou (ATHENA)
Responsible
Author
Athanasia - Lida Dimou Email < [email protected]>
Partner ATHENA Phone +30-210-6875358
Keywords Data synchronization, data post-processing, multimodal sensory corpora
annotation
Version Log
Issue Date Rev
No.
Author Change
15-07-2014 1.0 Evita Fotinea First Draft
17-07-2014 1.1 Athanasia - Lida Dimou Added ILSP data
25-07-2014 1.2 Angelika Peer, Milad Geravand Added TUM data
25-07-2014 1.3 Khai-Long Ho Hoang Added UHEI data
26-07-2013 1.4 Athanasia - Lida Dimou, Text & figures, finalization of
FP7 - MOBOT – 600796 4
Panagiotis Karioris, Theodore
Goulas
collaborative text on
annotation tiers from ICCS,
INRIA
28-07-2013 2.0 Evita Fotinea, Eleni Efthimiou Finalization
01-08-2014 2.1 Davide Dorradi Internal reviewer’s comments
17-08-2014 2.1 Iassonas Kokkinos Internal reviewer’s comments
28-08-2014 2.2 Athanasia – Lida Dimou, Evita
Fotinea, Eleni Efthimiou
Internal reviewers comments,
processing & modification,
finalization
TABLE OF CONTENTS
Executive Summary .......................................................................................................... 2
1. Introduction .................................................................................................................. 6
1.1. Creating multimodal sensory corpora.................................................................... 6
2. Data Post-Processing .................................................................................................... 7
2.1. Quantitative information of the obtained multisensory data ................................. 7
2.2. Synchronization of the acquired multimodal sensory corpus ................................ 8
2.3. Post-Processing for the motion capture system ..................................................... 8
2.3.1. Visual3D ......................................................................................................... 9
2.4. Video post-processing for PiP creation files ....................................................... 11
3. Annotation schemes for the audiovisual data ............................................................. 11
3.1. Introduction: Aims, scope ................................................................................... 12
3.2. Annotation Scheme: From generic to specific .................................................... 13
3.2.1. Generic.......................................................................................................... 13
3.2.2. Specific ......................................................................................................... 14
4. Conclusion .................................................................................................................. 16
References ...................................................................................................................... 16
LIST OF FIGURES
Figure 2-1: Motion capture markers on one of the patients of the MOBOT Recordings 9
Figure 2-2: Motion capture data exported from Qualisys (left), .................................... 10
Figure 2-3: Coordinate frames associated to the human body segments. ...................... 11
Figure 3-1: Sample of an annotation PiP video (Scenario 3, Patient 6). ........................ 14
LIST OF TABLES
Table 2-1: Total number of files by type of visual sensor for annotation ........................ 8
Table 3-1: Duration and number of files to be annotated per scenario/variant. ............. 12
FP7 - MOBOT – 600796 5
LIST OF ABBREVIATIONS
Abbreviation Description
PR Public Report
WP Work Package
Partner Abb. Description
TUM TECHNISCHE UNIVERSITAET MUENCHEN
ICCS INSTITUTE OF COMMUNICATION AND COMPUTER
SYSTEMS
INRIA INSTITUT NATIONAL DE RECHERCHE EN
INFORMATIQUE ET EN AUTOMATIQUE
ECP ÉCOLE CENTRALE DES ARTS ET MANUFACTURES
UHEI RUPRECHT-KARLS-UNIVERSITAET HEIDELBERG
ILSP / ATHENA
RC
INSTITUTE FOR LANGUAGE AND SPEECH
PROCESSING / ATHENA RESEARCH AND INNOVATION
CENTER IN INFORMATION COMMUNICATION &
KNOWLEDGE TECHNOLOGIES
ACCREA ACCREA BARTLOMIEJ MARCIN STAŃCZYK
BETHANIEN BETHANIEN KRANKENHAUS - GERIATRISCHES
ZENTRUM - GEMEINNUTZIGE GMBH
DIAPLASIS DIAPLASIS REHABILITATION CENTER SA
FP7 - MOBOT – 600796 6
1. INTRODUCTION
Having successfully completed the data acquisition during the recording/measurement
procedure in Agaplesion/Bethanien Hospital in November 2013, a complete dataset of
multisensory data was available to be utilized. Post-processing procedures included
several operations that were indispensable in order to provide the data in different as
well as functional formats. The synchronization of the acquired multimodal sensory
corpus was one such operation, described in more detail in the next section.
With reference to the audiovisual data, the previously synchronized data drawn from the
multiple video streams of the MOBOT dataset (4 different visual sources) were
provided in a Picture in Picture (PiP) format suitable for the annotation procedure.
Lastly, the PiP video files along with a modular annotation template – adjustable to the
needs of each scenario/variant – were provided to the annotators so that they could
manually annotate the audiovisual data that would set the basis for the HRI
communication model of the project.
1.1. Creating multimodal sensory corpora
The proposal on the recording scenarios for the multimodal sensory corpora was
developed and finalized after discussions among all partners involved in data
acquisition. The planning of these recording scenarios was completed in collaboration
with all MOBOT partners by September 2013; the recording/measurement sessions took
place at BETHANIEN and were completed in November 2013.
The recording/measurement procedure as well as all the relevant details regarding the
data acquisition process i.e. patients, scenarios, performed tasks, etc. are extensively
presented in the publicly available deliverable D2.1- Data acquisition and multimodal
sensory corpora collection. To keep this document self-contained, we briefly
summarize below the three types of sensors that provided the data to be post-processed:
Motion capture system: For the purposes of motion capturing, a Qualisys system
with 8 cameras was used. The cameras were mounted on tripods and placed around
the recording area. Passive reflective markers were installed on the human bodies of
patient and carer to measure their human limb movements. The marker set was
specially chosen after taking into account several limiting factors of the recording
population as well as potential supporting areas on the human body (which should
stay free of markers). In order to distinguish between carer and patient, two
additional markers were added to the head of the carer. Further visual markers were
added to objects placed in the acquisition space such as the door, the door frame,
and the obstacle.
Audiovisual data: these included HD cameras, Kinect Cameras and a GoPro
Camera. For the purposes of the recording sessions the following media were used.
o Microphone Arrays: an array of MEMS microphones was obtained for R&D
purposes, even though such devices are not yet a commercial product.
o 4 HD cameras:
Central: It was placed so as to record the patient when walking within the
recording area.
FP7 - MOBOT – 600796 7
Global: It was placed so as to cover any optical gaps and provide further
information of the patient’s motion and posture as well as details of
manoeuvring.
GoPRo: It was set on the passive rollator. The main criterion for the choice
of this particular camera was its ability to record at close range and at a
stable distance the patient’s torso and arms and, in some cases, the head as
well.
Side: This camera was always on one side to supplement and gave
information that was missed by the other cameras. Its position was not
predefined and it changed according to the different scenarios, so as to
achieve optimum position and offer the best viewing angle.
o 2 Kinect Cameras: We decided on using two Kinect-for-Windows (KFW)
sensors that come equipped with the ‘near mode’ option.
Upper Kinect: The first sensor was facing horizontally towards the patient,
aiming at capturing the area of the torso, waist, hips and the upper part of the
limbs.
Lower Kinect: The second sensor was facing downwards, capturing lower
limb motion, so as to enable the estimation of 3D limb positions and,
eventually, the analysis of gait abnormalities.
o Other sensors: laser range finders and force/torque sensors were mounted on
the rollator.
Laser range finders: two laser range finder sensors are mounted on the
rollator. One is on the front, facing towards the direction of motion, to
provide a full scan of the walking area. The other one is on the back, facing
towards the user's legs, aiming to provide data on the gait of the patient.
Force/torque sensors: two 6 DoF force/torque sensors are placed on the
handles of the rollator.
Below we briefly report on the labour-intensive procedure of post-processing of all
available post-processed data.
2. DATA POST-PROCESSING
2.1. Quantitative information of the obtained multisensory data
In total, six different scenarios were recorded, each serving a distinct purpose in the data
acquisition process. Each of these six scenarios had three variants; these variants were
specially designed in order to provide the participants with enough flexibility to perform
the easiest variants in case they could not perform all of them. Each variant of each
scenario was designed in a manner that would allow it to be repeated 1-5 times. The
number of trials varied according to the difficulty of each scenario, scenario/ variant and
/or performance of the individual informants.
Motion capture system: Qualisys motion tracking system with eight infrared
cameras, 48 reflective markers (10 additional markers for static trials), recording
volume: 2.7m x 3m x 2m (width, length , height)
Audiovisual data: HD cameras, Kinect Cameras, GoPro Camera, Microphone
Arrays
FP7 - MOBOT – 600796 8
Sensor data: laser range sensors.
Table 2-1: Total number of files by type of visual sensor for annotation
2.2. Synchronization of the acquired multimodal sensory corpus
Upon the creation of the MOBOT multimodal multisensory corpus recorded at the
premises of the Bethanien/Agaplesion hospital, the acquired video files (from external
HD cameras as well as from the Kinect and GoPro, mounted on the sensitized passive
rollator) were obtained, and the Kinect raw files were synchronized with the rest of the
visual data, providing a synchronization scheme between external HD data and the
RosBag related multimodal multisensory data.
The 2 HD cameras and the GoPro camera were synchronized to the extracted RGB topic
of the upper Kinect camera. Through this synchronization (HD with RGB topic of the
Upper Kinect) it was possible to have access to all other sensory data within the RosBag
files at specific timestamps, as indicated by the visual sensors.
All visual mediums were synchronized via a manually shot flashlight during the
recording/measurement sessions. The order in which the sensors started recording
created a time offset between the RosBag capturing and the flashlight; typically, in the
recording sessions the RosBag was initialized first then followed initialisation of all
audiovisual media (HD and Microphone Arrays) and afterwards there was a manual
shot of a flashlight, which is the first frame of the PiP video files used for the annotation
of the audiovisual data (see below). So in order to calculate the RosBag initial
timestamp, the time offset needed to be added to the timestamp of the flashlight.
2.3. Post-Processing for the motion capture system
The post-processing procedure of the motion capture data is split up into two parts:
Cleaning of raw data and labelling of marker trajectories involving the QTM-
manager software:
The image-based 3D-recordings of the trials are cleaned from gaps, phantom
markers, flickering and other inconsistencies which occur due to occlusions,
reflections, loose clothes of the patient, missing markers, and other unexpected
incidences during the recordings. Marker trajectories that have been mismatched
by the automatic marker identification algorithms of the software have to be
identified and reassigned manually. The corrected marker trajectories are
identified and labelled according to a unique nomenclature specifically
developed for the MOBOT recordings. Globally, the procedure of labelling the
FP7 - MOBOT – 600796 9
raw motion capture data is a complex task as labelling the recording of a single
trial can take up to several hours, due to the inconsistencies of the recorded data,
mentioned above. However, despite all the inconveniencies, the labelling
process is still at an ongoing state. The following figure presents a plot of the
markers on one of the patients.
Figure 2-1: Motion capture markers on one of the patients of the MOBOT
Recordings
Reconstruction of human model and motion from marker data involving the
Visual3D software:
A 15-segmented model of the human body is generated for each patient by
assigning the segments to marker sets. The consistency of the segment-marker
set assignment depends strongly on the quality of the results from the previous
step. In some cases, virtual markers can be used to replace missing markers in
the recordings. Each assigned segment is associated with biomechanical
parameters that represent the mass and inertia properties of the segment. The
model is applied on the static and dynamic recording files of the corresponding
patient so that motion data, such as position/orientation of body segments and
relative joint angles, is extracted.
2.3.1. Visual3D
Visual3d is a research software for 3D Motion Capture data analysis and modelling and
is used for derivation of the patient’s motion data. Model definition applying the model
to motion capture data and desired model-based data extraction are the three main steps
to be carried out in Visual3D.
A biomechanical human model including 15 body segments was defined for each
patient (see Figure 2-1). Each body segment was defined by using “static trials” and
tracked during “movement trials”. To define each body segment, firstly calibration
markers in static trials were used to associate the orientation of the segment’s coordinate
systems relative to the tracking markers. Secondly, the segment’s endpoints were
FP7 - MOBOT – 600796 10
defined considering the mid-point between two medial and lateral markers on each pair
of calibration markers as well as proper association of the segment’s radius. Thirdly, the
frontal plane for each segment was defined by the proper connection of at least three
markers constituting a plane. Finally, the segment coordinate system was defined as
follows: the z-axis by connecting segment end points, the y-axis perpendicular to the z-
axis and the frontal plane, and the x-axis perpendicular to the z-axis and y-axis (right-
hand rule. Figure 2-2 shows the currently assigned coordinate frames.
After that, the human model for each patient was applied to his/her movement trials.
Biomechanical model-based calculations were then defined and performed to extract the
desired information. Currently joint angles, joint velocities, and accelerations have been
derived while computation of other values can be carried out based on consortium
request. Exploring the data from movement trials and associating these data with the
model is done visually for each single trial. If errors or no correspondence between
obtained and expected results are found, both the model parameters and the marker sets
defining body segments are modified (as much as required) in Visual3D or markers are
amended in Qualisys. Approved results are saved for further analysis in WP1 and WP2.
Figure 2-2: Motion capture data exported from Qualisys (left),
and its corresponding biomechanical human model with segment definition (right).
FP7 - MOBOT – 600796 11
Figure 2-3: Coordinate frames associated to the human body segments.
2.4. Video post-processing for PiP creation files
Along with the synchronization procedure the “picture in picture” PiP video files were
created. The importance of this procedure is considerable as the resulting video files
provide the input which the annotators are called to manually annotate. The streams
from each medium were obtained independently, and were brought together in a PiP
stream in order to have all the information accumulated so as to facilitate the annotation
procedure of the video channel.
The PiP video files consist of 4 visual inputs:
– 1 Kinect camera (Upper Kinect): Upper Left
– 1 Go Pro camera: Upper Right
– HD Central: Lower Left
– HD Global: Lower Right
The files form the GoPro camera as well as the files from the RGB topic of the upper
Kinect (Kinect/RosBag) were converted into mpeg2 format; this was decided due to
codec incompatibility to handle noise (shadows, ghosting, etc).
3. ANNOTATION SCHEMES FOR THE AUDIOVISUAL DATA
The following table presents the duration of each scenario variant, the participants that
performed it, as well as the amount of files to be annotated.
FP7 - MOBOT – 600796 12
Table 3-1: Duration and number of files to be annotated per scenario/variant.
3.1. Introduction: Aims, scope
For the annotation procedure the primary recordings were maintained and new ones on
maximum resolution (HD) were created; along with these processes, the recording files
were also generated in compressed format (mp4) in order to facilitate the annotation
procedure. Furthermore, we favoured the creation of files in lossy format and lower
resolution e.g. 426*240 for better management of the video files as well as faster access
and exchange among partners.
The annotation of the visual data was performed in the ELAN environment (ELAN
4.6.21), an annotation environment specifically designed for the processing of
multimodal resources [Brugman et al., 2004]. Annotation is time aligned; each channel
of information was annotated into a separate annotation tier, which may consist of
several sub tiers according to the level of the fine-grained information that is needed.
The output of the annotation procedure was exported into .xml files. Previous
consortium agreement on prioritization of annotating the different scenarios had dictated
the following scenario order for the actual annotation procedures: 1,2,3,6,4,5. However,
as the annotation scheme was adapted to the needs of the project, providing extra
annotation tiers for the more complex scenarios/variants and hence making annotation
more time-consuming, the prioritization order was altered in order to promote all b
variants (with the rollator in following mode): 1,2,3.b, 3.b.2, 3.a, 6.a, 5.b, 4.a, 3.c, 4.c,
5.a
1 http://tla.mpi.nl/tools/tla-tools/elan/ , Max Planck Institute for Psycholinguistics, The Language Archive,
Nijmegen, The Netherlands
FP7 - MOBOT – 600796 13
All the PiP video files along with all the annotation files were uploaded in 2 websites:
a. in the TUM server of the MOBOT project
b. in an ILSP server
3.2. Annotation Scheme: From generic to specific
A preliminary inspection of the audiovisual data dictated the creation of at least 5
different major annotation tiers describing the scenario, the predefined tasks in each
scenario, the actions that were eventually performed, information from the audio
channel (noise, oral commands), information form the visual channel (noise, gestures,
pauses, stumbling etc). In parallel, discussions between partners lead to the finalization
of the annotation schemes followed in the different recording scenarios of the MOBOT
multimodal multisensory corpus.
3.2.1. Generic
This generic annotation scheme, that was later on enriched, consisted of the three
following annotation clusters, each containing multiple tiers, a brief description of
which is sketched below:
1. Information about the annotated data: Scenario, variant, task. This cluster
provides information, much like metadata, regarding the source of the annotated
data (which scenario and which variant) as well as a standard account of the
duration of the tasks that the patients were asked to perform.
2. Visual Input: Performed actions and gestures of the patient and visual noise
coming from the recording environment (mostly the carers). This annotation
cluster is the richest one as it provides more in-depth information regarding the
actions that were eventually performed and the gestures of the participants:
a. Within the same timestamps that were attributed for the duration of each task in
the task tier (see above) the annotator marks all actually performed actions. This,
in most cases, except for the few in scenario 1 in which the task and the actual
performed action coincide, signifies that the time boundaries set to describe a
task are divided into shorter segments and are aligned to the different sets of
actions each task includes, i.e. a simple sit-to-stand transfer with the rollator in
following mode is annotated as “Sit-to-stand” in both Task and Action tier while
the same sit-to-stand transfer with the rollator assisting the patient is annotated
as “Sit-to-stand” in the task tier and contains two segments in the Action tier:
“Grasp handles” and “Sit-to-stand”.
b. Two tiers were attributed to the annotation of the gestures: one marking the
duration of the gesturing activity and attributing the equivalent command and
another one marking the Handshape – hand-formation during the gesture
performance – adopting the HamNoSys2 notation system [Hanke, T. 2004] with
which the gesture was performed.
c. In the generic annotation scheme, this cluster contained also annotations
regarding the carer, which were generally described as noise.
2 http://www.sign-lang.uni-hamburg.de/projects/hamnosys.html
FP7 - MOBOT – 600796 14
3. Audio Input: Uttered speech vs non-speech and audio commands. At its original
state, this annotation cluster contained two tiers:
a. A speech/non-speech tier, which actually marked all parts of the audio sequence
that contained clearly uttered and fairly comprehensible speech versus all parts
that contained noise and noise-like audio interventions as well as non-
comprehensible speech parts, which in most cases consisted of sequences in
which many people talked at the same time.
b. An audio command tier in which timestamps for all audio commands, deriving
either from the patient or the carer(s), are marked.
3.2.2. Specific
Further, along the annotation procedure and following valuable discussions with all
technical partners it was made clear that the finalization of the annotation scheme
needed to take into consideration the particularities of each scenario variant in order to
provide actually usable annotation data. Consequently, the above described annotation
scheme template was first enriched as a whole and then adapted according to the needs
of each particular scenario variant.
Figure 3-1: Sample of an annotation PiP video (Scenario 3, Patient 6).
After this adaptation, the template output of the final version of the annotation scheme
is not a generic scheme but an in-depth annotation/analysis. The alterations and
enhancements that were made with respect to the previous annotation scheme are briefly
described below:
1. Actions/Actions_2: In several variants it was noted that the patient performed a task
which comprised of two actions that could not be linearly annotated into one tier; a
very typical example is the one given above with the “Grab Handles” and the “Sit-
FP7 - MOBOT – 600796 15
to-stand” transfer. So another tier named “Actions_2” was added in order to provide
the correct timestamps and durations of each patient’s actions. Furthermore, the list
of actions was enhanced in general, as the descriptions of the tasks needed to be
divided into more specific actions.
2. Gestures_Type of movement: In several cases, when asked to perform specific
gestures, patients made mistakes which can be clustered into 3 types: a. they
repeated the movement of a previous gesture with the correct handshape, b. their
gesture had a small but significant deviation from the gesture that was shown to
them (i.e. only one hand or use of different finger), or c. they performed something
entirely different from what they had been shown. In order to provide sufficient data
for the patients’ performance for the first two cases, an additional tier was needed to
be activated only in cases in which there appears this kind of mistake. This tier
would contain annotations describing the type of movement actually performed.
3. Visual Noise: All involved partners that acquired the annotated data made it
specifically clear that the annotation of most, if not all, types of visual noise that are
present in the MOBOT multisensory data is indispensable; a multisensory recording
dataset such as this, in which the interference of the assistive personnel is
unavoidable, is bound to provide data that need to be treated for noise reduction.
After long but fruitful discussions, it was decided that noise would be annotated for
two out of the four cameras, the Upper Kinect and the GoPro camera.
a. As far as the GoPro camera is concerned, we annotate the presence or the
absence of noise (another person or part of another person) in the vicinity of
(next to) the patient. In case of presence we also mark location of the noise with
respect to the patient (right, left, both).
b. As far as the Kinect camera is concerned, three types of noise are marked:
i. The presence or absence of noise in the frame, which is narrower than the
respective one of the GoPro camera. In case of presence we also mark location
of the noise with respect to the patient (right, left, both).
ii. The cases in which there is occlusion of a carer or an object in the space
between the front of the patient and the Upper Kinect. As in both previous
cases, we also locate the position of the noise in the frame with respect to the
patient (right, left, both).
iii. The cases in which the carer touches the patient; this type of noise includes all
instances in which a carer or part of his/her body is situated right next to the
patient or behind the body of the patient. The location of the noise is also
annotated (right, left, both) with respect to the patient.
4. Visual Channel / Carer: After having discussed the importance of annotating the
presence or absence of noise, it was also made clear that the presence of the assistive
personnel within the visual data can also be treated usefully; therefore, three tiers
dedicated to the carers were added under the section that was intended for the
annotation of the patient actions; of course, the presence of values in these three
tiers implies that the respective noise tiers are active as the carer is present within
the Kinect frame.
i. Meaningful Gesture: This tier is dedicated to the cases in which the carer is
visible in the Kinect camera, showing the patient how to perform the specific
set of gestures. This tier is complementary to the gesture tier, as it provides
FP7 - MOBOT – 600796 16
information about the visual input the patients had in order to perform the
gestures, a control over the patients’ performance and additional quantitative
data of people performing the specific gestures.
In case the tier for the noise visibility in the Kinect camera is active, the activity tier
or the stationery tier should also be marked.
ii. Activity: In this tier we mark that the carer is within the frame of the Upper
Kinect and is moving in a location near the patient.
iii. Stationary: In this tier we mark that the carer is within the frame of the Upper
Kinect and is standing still (probably moving his/her arms and upper part of
the body but remaining in the same position) in a location near the patient.
5. Audio Noise: A tier was added in which the verbal commands are translated into
English.
The finalization of the annotation scheme provided a template that is modular according
to the needs of each scenario variant. This procedure, although extremely interesting,
was proven far more time-consuming than expected, due to its elaboration and
adaptation to scenario-specific needs.
4. CONCLUSION
The D2.2 deliverable presents a methodological input on the post-processing of the
multimodal data for the MOBOT project.
Several levels of post-processing were presented; from the synchronization issue of all
multisensory media, a major milestone in the MOBOT project, to the post-processing of
the force/torque sensors and the motion capture system to the PiP video creation for the
annotation of the audiovisual data, as well as the actual annotation of the latter.
Significant effort was given to the creation of the annotation scheme finally adopted as
it was important that all information strings valuable to the technical partners would be
included, in a way that would enable a fast as well as effective annotation procedure.
The measurable outcome of this deliverable will be evaluated through a peer-reviewed
crosscheck of the annotated videos as well as from the feedback that will be received
from all involved partners.
It is foreseen that for a restricted part of the multimodal corpus a further more fine-
grained annotation will be performed. This, after being further discussed among
involved partners, will focus on resolving specific modelling/training issues that the
present annotation scheme might not be able to cover.
REFERENCES
[Brugman et al., 2004] H. Brugman, A. Russel, Annotating Multimedia/Multi-modal resources with
ELAN, In: Proceedings of LREC 2004, 4th International Conference on Language Resources and
Evaluation, 2004.
[Folstein et al., 1975] M. F. Folstein, S.E. Folstein, and P. R. McHugh, Mini-mental state. A practical
method for grading the cognitive state of patients for the clinician”. Journal of Psychiatric Research,
vol.12 (3), pp.89-98, 1975.
[Hanke, T. 2004]: “HamNoSys - representing sign language data in language resources and language processing
contexts.” In: Streiter, Oliver, Vettori, Chiara (eds): LREC 2004, Workshop proceedings: Representation and
processing of sign languages. Paris : ELRA, 2004.
FP7 - MOBOT – 600796 17
[Matthes et al., 2012a] S. Matthes, T. Hanke, A. Regen, J. Storz, S. Worseck, E. Efthimiou, A-L. Dimou,
A. Braffort, J. Glauert, E. Safar, Dicta-Sign – Building a Multilingual Sign Language Corpus, In
proceeding of 5th Workshop on the Representation and Processing of Sign Languages: Interactions
between Corpus and Lexicon (LREC 2012), Istanbul, Turkey, 2012.
[Matthes et al., 2012b] S. Matthes, T. Hanke, J. Storz, E. Efthimiou, A-L. Dimou, P. Karioris, A. Braffort,
A. Choisier, J. Pelhate, E. Safar, Elicitation Tasks and Materials designed for Dicta-Sign’s Multi-lingual
Corpus, In proceeding of 5th Workshop on the Representation and Processing of Sign Languages:
Interactions between Corpus and Lexicon (LREC 2012), Istanbul., Turkey, 2012.
[Wallraven et al., 2011] C. Wallraven, M. Schultze, B. Mohler, A. Vatakis and K. Pastra: "The
POETICON enacted scenario corpus - a tool for human and computational experiments on action
understanding". In 9th IEEE Conference on Automatic Face and Gesture Recognition (FG'11), 2011.