Automatic Prediction for User Engagement Intention in ... › NewsEvents › Events › PastSeminars...

transcript

Automatic Prediction for User Engagement Intention in Multi-party Human-Robot Interaction

Zhang Zhijie

Supervisor: Assoc. Prof. Zheng Jianmin

Co-Supervisor: Prof. Nadia Magnenat Thalmann

School of Computer Science and Engineering

Jan. 31, 2019

• Problem Statement

• State-of-the-arts

• Proposed Ideas

• Implementation

• Results

• Summary

Outline

• Social robots are expected to engage in real-world human-robot interaction serving as receptionists, servicers or companion friends.

• To interact with human users, robots have to be attuned to the characteristics of the environment and understand user intentions.

1. Problem Statement (Background)

• Greeting to whom? Speakers, addressees, side-participants or passersby.

• How confident? People’s willing level.

• How to generate behaviors? Say hi, wave hand, eye contact?

1. Problem Statement (Background)

By predicting people’s engagement intention, social robot is able to

• avoid the social costs of interacting with an unwilling participant

• handle dynamic environment

• support a natural interaction process by planning behaviour in advance

1. Problem Statement (Objectives)

• It is difficult to gain credible data;

• The time from arising intention to actual interaction is of short duration.

• Human intention is often difficult to recognize through user behaviours, which is even a challenging task for human beings.

1. Problem Statement (Challenges)

• Heuristic spatial model [1]

• Data-driven approach [2]

• Schegloff body torque model [3]

• Comparison between different approaches [4]

• Action-state transition model [5]

2. State-of-the-arts

[5] Ozaki, Y., Ishihara, T., Matsumura, N., Nunobiki, T. and Yamada, T., 2018, August. Decision-Making Prediction for Human-Robot

Engagement between Pedestrian and Robot Receptionist. In 2018 27th IEEE International Symposium on Robot and Human Interactive

Communication (RO-MAN), pp. 208-215.

• Inputs:

(a) spatial distance from a laser tracker

(b) head pose from a camera

• Outputs labels:

(a) engaged

(b) not-engaged

2. State-of-the-arts (spatial model)

[1] Michalowski, M.P., Sabanovic, S. and

Simmons, R., 2006. A spatial model of

engagement for a social robot. In Advanced

Motion Control, 2006. 9th IEEE International

Workshop on, pp. 762-767.

• Inputs:

(a) face location, width and height

(b)face frontal features (frontal or not, a confidence score)

(c) trajectory of previous two features

• Outputs labels: (a)positive engagement intention

(b)negative engagement intention

2. State-of-the-arts (data-driven model 1)

[2] Bohus, D. and Horvitz, E., 2009. Learning to predict

engagement with a spoken dialog system in open-world settings. In

Proceedings of the SIGDIAL 2009 Conference: The 10th Annual

Meeting of the Special Interest Group on Discourse and Dialogue,

pp. 244-252.

• Inputs ():

(a)stance for feet, hips, torso and shoulders

(b)torque angle for hips, torso, shoulders

(c) skeleton distance

(d)face position and size

(e)speech activity and source localization

• Outputs labels(a)will interact

(b)no one

(c) someone around

2. State-of-the-arts (data-driven model 2)

[3] Vaufreydaz, D., Johal, W. and Combe, C., 2016. Starting engagement detection

towards a companion robot using multimodal features. Robotics and Autonomous

Systems, 75, pp.4-16.

• Inputs:

(a)3D coordinates of head, hands

(b)the angle of torso

(c) speaking status (speak or not)

• Outputs labels: (a)Not Seeking Engagement

(b)Seeking Engagement

• Hand-coded rule:

(a)The distance between robot and

head less than 30cm

(a)absolute torso angle under 10 degree

2. State-of-the-arts (comparison)

[4] Foster, M.E., Gaschler, A. and Giuliani, M., 2017. Automatically Classifying User Engagement for Dynamic Multi-party

Human–Robot Interaction. International Journal of Social Robotics, 9(5), pp.659-674.

Cross-validation

Test set

• Heuristic methods without considering real social clues in human-human interaction

• Lack of systematic ideas and psychological supports

• Unsatisfying experiment results and unstable prediction

2. State-of-the-arts (limitations)

Yellow indicates Blue indicates No Engagement Intention Has Engagement Intention

• six stages of general greeting situations [6] ()

• body movements, gestures, facial expressions and salutation are useful signals for predicting human engagement intention [7][8]

• people’s body behaviours change while approaching to initiate an interaction [7][8]

• people use different cues to predict other’s intentions based on distance [9]

3. Proposed Idea (psychology)

[6] Kendon, A. and Ferber, A., 1973. A description of some human greetings. Comparative ecology and behaviour of

primates, 591(668), p.12.

[7] Mead, R., Atrash, A. and Matarić, M.J., 2011, November. Proxemic feature recognition for interactive robots: automating

metrics from the social sciences. In International conference on social robotics (pp. 52-61). Springer, Berlin, Heidelberg.

[8] Schegloff, E.A., 1998. Body torque. Social Research, pp.535-596.

[9] Langton, S.R., Watt, R.J. and Bruce, V., 2000. Do the eyes have it? Cues to the direction of social attention. Trends in

cognitive sciences, 4(2), pp.50-59.

six stages of general greeting situations [6]

• Orientation and initiation approach.

• Distant salutation, movements and gestures that signal official acknowledge that a greeting sequence has been.

• Head dips which signifies transitions between acts and psychological orientation.

• The approach which assumes that the greeting process continuous. In this stage, participants may move toward, gaze and extend arms.

• Final approach with smiling, mutual.

• The close salutation including ritualistic speech such as “how are you” and body contact like handshake.

[6] Kendon, A. and Ferber, A., 1973. A description of some human greetings. Comparative ecology and behaviour of

primates, 591(668), p.12.

• body movements, gestures, facial expressions and salutation are useful signals for predicting human engagement intention [7][8]

• people’s body behaviours change while approaching to initiate an interaction [7][8]

• people use different cues to predict other’s intentions based on distance [6][9]

[7] Mead, R., Atrash, A. and Matarić, M.J., 2011, November. Proxemic feature recognition for interactive robots: automating

metrics from the social sciences. In International conference on social robotics (pp. 52-61). Springer, Berlin, Heidelberg.

[8] Schegloff, E.A., 1998. Body torque. Social Research, pp.535-596.

[9] Langton, S.R., Watt, R.J. and Bruce, V., 2000. Do the eyes have it? Cues to the direction of social attention. Trends in

cognitive sciences, 4(2), pp.50-59.

Features, as inputs of predicting participant’s engagement intention, can be categorized into 4 subgroups:

* Pose means 3d position and orientation

3. Proposed Idea (included features)

Spatial Information

BodyPostures

Facial Expression

AudioSignals

• Location & Distance

• Moving Speed

• Head Pose*• Torso Pose• Stance Pose• Hand Pose

• Gaze Direction• Smile

Detection

• Speaking State

Utilise different features when a user approaches the robot

• in the nearest space, when user’s gaze is detectable, we use gaze direction;

• in the middle-distance space, we use head orientation;

• in the periphery space (when only user’s body can be detected confidently) we use body orientation.

3. Proposed Idea

Distance

For the prediction, we considered two data-driven approaches

• a hidden Markov model (HMM) and

• a gated recurrent unit model (GRU)

3. Proposed Idea

Start Probability P y0

Emission Probability P xi|yi

Transition Probability P yi|yi−1

Optimization

3. Proposed Idea

= 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛

𝑌 = agemax𝑌

𝑃(𝑦0, 𝑦1, … , 𝑦𝑇|𝑥0, 𝑥1, … , 𝑥𝑇)

𝑃 𝑥 = 𝐻𝐸𝐼|𝑦 = 𝐻𝐸𝐼 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)

𝑃 𝑥 = 𝑁𝐸𝐼|𝑦 = 𝐻𝐸𝐼 = 𝐹𝑁/(𝑇𝑃 + 𝐹𝑁)

𝑃 𝑥 = 𝐻𝐸𝐼|𝑦 = 𝑁𝐸𝐼 = 𝐹𝑃/(𝐹𝑃 + 𝑇𝑁)

𝑃 𝑥 = 𝑁𝐸𝐼|𝑦 = 𝑁𝐸𝐼 = 𝑇𝑁/(𝐹𝑃 + 𝑇𝑁)

𝑃 𝑦𝑖 = 𝐻𝐸𝐼|𝑦𝑖−1 = 𝐻𝐸𝐼 = 0.8

𝑃 𝑦𝑖 = 𝑁𝐸𝐼|𝑦𝑖−1 = 𝐻𝐸𝐼 = 0.2

𝑃 𝑦𝑖 = 𝐻𝐸𝐼|𝑦𝑖−1 = 𝑁𝐸𝐼 = 0.2

𝑃 𝑦𝑖 = 𝑁𝐸𝐼|𝑦𝑖−1 = 𝑁𝐸𝐼 = 0.8

Assume: when a human is doing an intentional action, s/he

keeps the intention in mind so that the transition from one state

to another does not happened frequently.

0.8 0.8

Update gate:

Reset gate:

Current memory content:

Final memory at current time step:

Maximum the log-likelihood of a model given a sequence:

3. Proposed Idea

[10] Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua

(2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv:1406.1078.

[11] Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural

Networks on Sequence Modeling". arXiv:1412.3555.

Models are trained offline on a data set posted by Foster et al. [4] with

• training data: 5090 instances (hand labeled)

• testing data: 361 instances

4. Implementation

Models are planed to be tested on our virtual human.

• Users will interact in a free-topic, face-to-face conversation.

• The visual sensor data will be gained by an ORBBEC Astra camera with around 30 fps.

• There is also a microphone for tracking audio signals.

4. Implementation

5. Results

Our prediction is comparative to Foster’s.

• Both prediction accuracies on cross-validation and test set are better.

• The stability also improves a lot.

5. Results

Session1

Session2

Session3

Session5

Session4

Session6

1: Annotation (Ground Truth)2: HMM3:GRU

5. Results

Yellow indicates No Engagement Intention

Blue indicates Has Engagement Intention

• To interact with human users, robots have to understand user intentions.

• The distance-based HMM & GRU provide better results in terms of accuracy and stability.

• The work needs to be tested on real-time experiment.

• Behaviour generation and responses could be studied in the future.

Summary

Automatic Prediction for User Engagement Intention in ... › NewsEvents › Events › PastSeminars...

Documents