Post on 29-Jun-2020
transcript
Automatic Prediction for User Engagement Intention in Multi-party Human-Robot Interaction
Zhang Zhijie
Supervisor: Assoc. Prof. Zheng Jianmin
Co-Supervisor: Prof. Nadia Magnenat Thalmann
School of Computer Science and Engineering
Jan. 31, 2019
• Problem Statement
• State-of-the-arts
• Proposed Ideas
• Implementation
• Results
• Summary
Outline
2
• Social robots are expected to engage in real-world human-robot interaction serving as receptionists, servicers or companion friends.
• To interact with human users, robots have to be attuned to the characteristics of the environment and understand user intentions.
1. Problem Statement (Background)
3
• Greeting to whom? Speakers, addressees, side-participants or passersby.
• How confident? People’s willing level.
• How to generate behaviors? Say hi, wave hand, eye contact?
1. Problem Statement (Background)
4
By predicting people’s engagement intention, social robot is able to
• avoid the social costs of interacting with an unwilling participant
• handle dynamic environment
• support a natural interaction process by planning behaviour in advance
1. Problem Statement (Objectives)
5
• It is difficult to gain credible data;
• The time from arising intention to actual interaction is of short duration.
• Human intention is often difficult to recognize through user behaviours, which is even a challenging task for human beings.
1. Problem Statement (Challenges)
6
• Heuristic spatial model [1]
• Data-driven approach [2]
• Schegloff body torque model [3]
• Comparison between different approaches [4]
• Action-state transition model [5]
2. State-of-the-arts
7
[5] Ozaki, Y., Ishihara, T., Matsumura, N., Nunobiki, T. and Yamada, T., 2018, August. Decision-Making Prediction for Human-Robot
Engagement between Pedestrian and Robot Receptionist. In 2018 27th IEEE International Symposium on Robot and Human Interactive
Communication (RO-MAN), pp. 208-215.
• Inputs:
(a) spatial distance from a laser tracker
(b) head pose from a camera
• Outputs labels:
(a) engaged
(b) not-engaged
2. State-of-the-arts (spatial model)
8
[1] Michalowski, M.P., Sabanovic, S. and
Simmons, R., 2006. A spatial model of
engagement for a social robot. In Advanced
Motion Control, 2006. 9th IEEE International
Workshop on, pp. 762-767.
• Inputs:
(a) face location, width and height
(b)face frontal features (frontal or not, a confidence score)
(c) trajectory of previous two features
• Outputs labels: (a)positive engagement intention
(b)negative engagement intention
2. State-of-the-arts (data-driven model 1)
9
[2] Bohus, D. and Horvitz, E., 2009. Learning to predict
engagement with a spoken dialog system in open-world settings. In
Proceedings of the SIGDIAL 2009 Conference: The 10th Annual
Meeting of the Special Interest Group on Discourse and Dialogue,
pp. 244-252.
• Inputs ():
(a)stance for feet, hips, torso and shoulders
(b)torque angle for hips, torso, shoulders
(c) skeleton distance
(d)face position and size
(e)speech activity and source localization
• Outputs labels(a)will interact
(b)no one
(c) someone around
2. State-of-the-arts (data-driven model 2)
10
[3] Vaufreydaz, D., Johal, W. and Combe, C., 2016. Starting engagement detection
towards a companion robot using multimodal features. Robotics and Autonomous
Systems, 75, pp.4-16.
• Inputs:
(a)3D coordinates of head, hands
(b)the angle of torso
(c) speaking status (speak or not)
• Outputs labels: (a)Not Seeking Engagement
(b)Seeking Engagement
• Hand-coded rule:
(a)The distance between robot and
head less than 30cm
(a)absolute torso angle under 10 degree
2. State-of-the-arts (comparison)
11
[4] Foster, M.E., Gaschler, A. and Giuliani, M., 2017. Automatically Classifying User Engagement for Dynamic Multi-party
Human–Robot Interaction. International Journal of Social Robotics, 9(5), pp.659-674.
Cross-validation
Test set
• Heuristic methods without considering real social clues in human-human interaction
• Lack of systematic ideas and psychological supports
• Unsatisfying experiment results and unstable prediction
2. State-of-the-arts (limitations)
12
Yellow indicates Blue indicates No Engagement Intention Has Engagement Intention
• six stages of general greeting situations [6] ()
• body movements, gestures, facial expressions and salutation are useful signals for predicting human engagement intention [7][8]
• people’s body behaviours change while approaching to initiate an interaction [7][8]
• people use different cues to predict other’s intentions based on distance [9]
3. Proposed Idea (psychology)
13
[6] Kendon, A. and Ferber, A., 1973. A description of some human greetings. Comparative ecology and behaviour of
primates, 591(668), p.12.
[7] Mead, R., Atrash, A. and Matarić, M.J., 2011, November. Proxemic feature recognition for interactive robots: automating
metrics from the social sciences. In International conference on social robotics (pp. 52-61). Springer, Berlin, Heidelberg.
[8] Schegloff, E.A., 1998. Body torque. Social Research, pp.535-596.
[9] Langton, S.R., Watt, R.J. and Bruce, V., 2000. Do the eyes have it? Cues to the direction of social attention. Trends in
cognitive sciences, 4(2), pp.50-59.
six stages of general greeting situations [6]
• Orientation and initiation approach.
• Distant salutation, movements and gestures that signal official acknowledge that a greeting sequence has been.
• Head dips which signifies transitions between acts and psychological orientation.
• The approach which assumes that the greeting process continuous. In this stage, participants may move toward, gaze and extend arms.
• Final approach with smiling, mutual.
• The close salutation including ritualistic speech such as “how are you” and body contact like handshake.
3. Proposed Idea (psychology)
14
[6] Kendon, A. and Ferber, A., 1973. A description of some human greetings. Comparative ecology and behaviour of
primates, 591(668), p.12.
• body movements, gestures, facial expressions and salutation are useful signals for predicting human engagement intention [7][8]
• people’s body behaviours change while approaching to initiate an interaction [7][8]
• people use different cues to predict other’s intentions based on distance [6][9]
3. Proposed Idea (psychology)
15
[7] Mead, R., Atrash, A. and Matarić, M.J., 2011, November. Proxemic feature recognition for interactive robots: automating
metrics from the social sciences. In International conference on social robotics (pp. 52-61). Springer, Berlin, Heidelberg.
[8] Schegloff, E.A., 1998. Body torque. Social Research, pp.535-596.
[9] Langton, S.R., Watt, R.J. and Bruce, V., 2000. Do the eyes have it? Cues to the direction of social attention. Trends in
cognitive sciences, 4(2), pp.50-59.
Features, as inputs of predicting participant’s engagement intention, can be categorized into 4 subgroups:
* Pose means 3d position and orientation
3. Proposed Idea (included features)
16
Spatial Information
BodyPostures
Facial Expression
AudioSignals
• Location & Distance
• Moving Speed
• Head Pose*• Torso Pose• Stance Pose• Hand Pose
• Gaze Direction• Smile
Detection
• Speaking State
Utilise different features when a user approaches the robot
• in the nearest space, when user’s gaze is detectable, we use gaze direction;
• in the middle-distance space, we use head orientation;
• in the periphery space (when only user’s body can be detected confidently) we use body orientation.
3. Proposed Idea
17
Robot
Distance
For the prediction, we considered two data-driven approaches
• a hidden Markov model (HMM) and
• a gated recurrent unit model (GRU)
3. Proposed Idea
18
Start Probability P y0
Emission Probability P xi|yi
Transition Probability P yi|yi−1
Optimization
3. Proposed Idea
19
= 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛
𝑌 = agemax𝑌
𝑃(𝑦0, 𝑦1, … , 𝑦𝑇|𝑥0, 𝑥1, … , 𝑥𝑇)
𝑃 𝑥 = 𝐻𝐸𝐼|𝑦 = 𝐻𝐸𝐼 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
𝑃 𝑥 = 𝑁𝐸𝐼|𝑦 = 𝐻𝐸𝐼 = 𝐹𝑁/(𝑇𝑃 + 𝐹𝑁)
𝑃 𝑥 = 𝐻𝐸𝐼|𝑦 = 𝑁𝐸𝐼 = 𝐹𝑃/(𝐹𝑃 + 𝑇𝑁)
𝑃 𝑥 = 𝑁𝐸𝐼|𝑦 = 𝑁𝐸𝐼 = 𝑇𝑁/(𝐹𝑃 + 𝑇𝑁)
𝑃 𝑦𝑖 = 𝐻𝐸𝐼|𝑦𝑖−1 = 𝐻𝐸𝐼 = 0.8
𝑃 𝑦𝑖 = 𝑁𝐸𝐼|𝑦𝑖−1 = 𝐻𝐸𝐼 = 0.2
𝑃 𝑦𝑖 = 𝐻𝐸𝐼|𝑦𝑖−1 = 𝑁𝐸𝐼 = 0.2
𝑃 𝑦𝑖 = 𝑁𝐸𝐼|𝑦𝑖−1 = 𝑁𝐸𝐼 = 0.8
Assume: when a human is doing an intentional action, s/he
keeps the intention in mind so that the transition from one state
to another does not happened frequently.
0.8 0.8
0.2
0.2
Update gate:
Reset gate:
Current memory content:
Final memory at current time step:
Maximum the log-likelihood of a model given a sequence:
3. Proposed Idea
20
[10] Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua
(2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv:1406.1078.
[11] Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural
Networks on Sequence Modeling". arXiv:1412.3555.
Models are trained offline on a data set posted by Foster et al. [4] with
• training data: 5090 instances (hand labeled)
• testing data: 361 instances
4. Implementation
21
Models are planed to be tested on our virtual human.
• Users will interact in a free-topic, face-to-face conversation.
• The visual sensor data will be gained by an ORBBEC Astra camera with around 30 fps.
• There is also a microphone for tracking audio signals.
4. Implementation
22
5. Results
23
Our prediction is comparative to Foster’s.
• Both prediction accuracies on cross-validation and test set are better.
• The stability also improves a lot.
24
5. Results
Session1
Session2
Session3
Session5
Session4
Session6
1: Annotation (Ground Truth)2: HMM3:GRU
25
5. Results
Yellow indicates No Engagement Intention
Blue indicates Has Engagement Intention
• To interact with human users, robots have to understand user intentions.
• The distance-based HMM & GRU provide better results in terms of accuracy and stability.
• The work needs to be tested on real-time experiment.
• Behaviour generation and responses could be studied in the future.
Summary
26