3 rd SINO-FRANCO WORKSHOP ON MULTIMEDIA AND WEB TECHNOLOGIES

transcript

3rd SINO-FRANCO WORKSHOP ON MULTIMEDIA AND WEB

TECHNOLOGIES

REPRESENTATION, RECOGNITION AND VISUALISATION OF HUMAN BEHAVIOURS

FOR VIDEO INTERPRETATION

- 1 - 04/20/23

François BREMOND, Monique THONNAT and Thinh VU Van

ORION lab, INRIA Sophia-Antipolis, FRANCE

Plan of presentation

Part I: Video interpretation – Global framework– Scenario recognition

Part II: Visualisation of the interpretation– Scene context (3D geometry)– Human body– Human behaviour– Results

Conclusion

- 2 - 04/20/23

Part I: Video interpretation:Global framework

Our goal: to model the interpretation process of video sequences from pixel up to behaviour.

Main issue: current video interpretation systems are based on specific (ad hoc) routines:

– depend on sensors (camera orientation).– dedicated to specific scenarios (detection of fighting

people) and sites (metro stations).

Recognised scenario

Video stream

Mobile object detection &

tracking

“Car accident?”

“Two strangers exchanging objects?”

Scenario recognition

Video interpretation: Global framework

We define several entities: Context object: predefined static object of the scene environment

(entrance zone, bench, walls, equipment,...). Moving region: any intensity change between a reference and the

current images. Mobile object: any moving region which has been tracked and

classified (person, group of persons, vehicle, noise, … etc). Basic action: spatio-temporal property, instantaneous, numerical,

generic state and event. Scenario: long term, symbolic, application dependent, behaviour and

activity.

- 4 - 04/20/23

Video interpretation:Global framework

- 5 - 04/20/23

A priori knowledge

Video stream

Moving region

detection

Mobile object

tracking

Recognition of actions

Recognition of scenario 1

Recognition of scenario 2

...Recognition of scenario n

Recognised scenario

module

Mobile object classes Context

objectsScenario library

Sensors information

Tracked object types

Descriptions of action

recognition routines

Definition: a priori knowledge describing: – the sensors (cameras, optical cells and contact sensors): 3D

position of the sensor, camera type (colour, resolution), field of view and calibration matrix.

– context objects: equipment (bench, trash, door), walls, interesting zones (entrance zone), areas of interest.

3D geometry: 3D location of the object and its volume. Semantic information: type of the object (equipment), its

characteristics (yellow, fragile) and its function (seat).

Role: – to keep the interpretation independent from the sensors and the

sites.– to provide additional knowledge to interpret up to the scenario

level.

Video interpretation: scene context

- 6 - 04/20/23

Issues: large variety of actions and scenarios– more or less abstract (running/fighting).– general (standing)/sensor and application (sit down) dependent.– spatial granularity: the view observed by one camera/the whole site.– temporal granularity: instantaneous/long term.– 3 levels of complexity depending on the complexity of temporal

relations and on the number of actors: non-temporal constraint relative to one actor (being seated). temporal sequence of sub-scenarios relative to one actor (open the door,

go toward the chair then sit down). complex temporal constraints relative to several actors (A meets B at the

coffee machine then C gets up and leaves).

Video interpretation: basic actions and scenarios

- 7 - 04/20/23

We use several formalisms Action and scenario representation:

– n-ary tree.– finite state automaton.– graph.– set of constraints.

Action and scenario recognition: – specific routines.– classification.– bayes.– HMM.– propagation of temporal constraints.– constraint resolution.

- 8 - 04/20/23

Example: a scenario is represented by a set of constraints.

Scenario(vandalism_against_ticket_machine,

Actors((p : Person), (eq : Equipment, Name = “Ticket_Machine”) )

Constraints( (exist ( (action s1: p move_close_to eq) (action s2: p stay_at eq)

(action s3: p move_away_from eq)

(action s4: p move_close_to eq) (action s5: p stay_at eq) )

( (s1 != s4) (s2 != s5)

(s1 before s2) (s2 before s3)

(s3 before s4) (s4 before s5) ) ) )

Production( (sc : Scenario)

( (Name of sc := "vandalism_against_ticket_machine")

(StartTime of sc := StartTime of s1)

(EndTime of sc := EndTime of s5) ) ) )

- 9 - 04/20/23

Video interpretation:basic actions and scenarios

- 10 - 04/20/23

Video interpretation:Part I: conclusion

Approach: a framework combining several formalisms:– structure the knowledge to obtain a general model.– to have a declarative description of the knowledge.– to make the knowledge explicit.– to mix bottom-up and top-down processing.– to use evaluation and learning techniques.

- 11 - 04/20/23

Part II: Visualisation of the interpretation

Development of a test platform for an AVIS (Automatic Video Interpretation System): (a) visualisation of the scenarios recognised by an AVIS.

(b) simulation of the input of an AVIS.

(c) verification that the test platform is coherent with the AVIS.

(d) validation of the AVIS.

- 12 - 04/20/23

Visualisation of the interpretation

3 tasks of the test platform:(1) generation of realistic 3D animations corresponding to

the scenarios recognised by an interpretation system.

(2) generation of videos from 3D animations using a model of a virtual camera.

(3) generation of realistic 3D animations corresponding to the scenarios described by an expert.

- 13 - 20/04/23

1Recognised scenario

State, event, scenario models

for the recognition

Scene context model

1Image sequence acquired by a camera

- 14 - 04/20/23

Scenario visualisation

1,2Image sequence acquired by a camera

2,3Generated image sequence

3Scenario described by experts

1,2,3Recognised scenario

3D Animation corresponding to the scenario

State, event, scenario models

for the recognition

Scene context model

Human body, action, scenario and animation

models for the visualisation

Scene context model

AVIS Test platform

- 15 - 04/20/23

Visualisation of the interpretation: approach

Conception of the test platform based on six generic models:

visualisation by using GEOMVIEW.

Animation

Scene context

Scenarios Actions

Human body

Camera

- 16 - 04/20/23

(1) : visualisation of a scene context for a metro station

(2) : example of a context object : a bench

(2)(1)

- 17 - 04/20/23

Visualisation of the interpretation: Scene context

Visualisation of the interpretation: Human body

Model: hierarchical and articulated.

The human body parts are build based on three primitives:(1) sphere.

(2) truncated cone.

(3) parallelepiped.

- 18 - 04/20/23

Generic model of the human body parts:(1) the relative position of the body part in the referential of

the super body part. For example, the hand is defined relatively to the arm.

(2) the angular co-ordinates of the body part in its referential.

(3) the size of the body part along its referential axis.

(4) the sub-parts or/and geometric primitives that constitute the body part.

(5) the colour of the body part.

- 19 - 04/20/23

Definition of 14 classes of human body parts: human body, head, arm, leg, shoulder,…

Different views:

(3) (4) (5)

(1) (2)

- 20 - 04/20/23

Human behaviours for interpretation systems:– basic action:

state: characterises an individual at a given time. event: change of states at two successive times.

– scenario: combination of actions.

Human behaviours for the test platform:– posture: corresponds to all body parameters of an

individual at a given time.– action: change of body parameters of an individual.– scenario: combination of actions.

- 21 - 04/20/23

Visualisation of the interpretation: Human behaviour

Generic model of action: concerned human body part. initial/final positions. variation of rotation angles around its

referential. global period of the action. list of sub actions with:

– the concerned sub part of human body.– the variation of rotation angles around the sub part referential.– their relative period.

visualisation speed. fixed part of human body on the ground.

- 22 - 04/20/23

Visualisation of the interpretation: Human behaviour: action

t = t1 t = t2

21 classes of actions: «walking», «running»,…

Actions «walking», «running» and «hand up»

- 23 - 04/20/23

Visualisation of the interpretation: Human behaviour: action

calculation of the current posture from the previous instant.

calculation of the global position of the individual: – automatic recognition: use the position of the detected

individual.– expert description: based

on a fixed point on the ground.

visualisation of geometric primitives through GEOMVIEW.

- 24 - 04/20/23

t = 100 t = 150

Fixed point on the ground

Visualisation of the interpretation: Human behaviour: visualisation of action

A scenario is a set of actions combining the individuals of the scene and the context objects which are relevant to the same activity.

Sequence of sub scenarios ordered by their period. Elementary scenario: action.

- 25 - 04/20/23

t = 80 t = 240

Visualisation of the interpretation: Human behaviour: scenario

«Walking on the platform» «Person A and person B meet at the coffee machine M»

- 26 - 04/20/23

Visualisation of the interpretation: Human behaviour: animation

«Pushing someone on the tracks» «Following another person»

- 27 - 04/20/23

Construction of models: • human body with 25 primitives.• 21 types of individual actions.• 4 types of scenarios.• 4 types of animations.

Generation of 7 types of 3D animations from descriptions.

Generation of 3D animations visualising individuals tracked by AVIS.

Checking the coherence by taking animations as input for AVIS.

- 28 - 04/20/23

Visualisation of the interpretation:Results

- 29 - 04/20/23

Raw video Tracked individuals

Animation of tracked individuals

Animation from a synthesised video

Visualisation of the interpretation:Results: comparison of 2 animations

Six generic models:• scene context.• virtual camera.• human body.• individual actions.• scenarios.• animations.

A description language for modeling the knowledge of the scene.

Validation of these models on metro scenes.

- 30 - 04/20/23

Visualisation of the interpretation:Part II: Conclusion and contributions

Help the developer:• visualisation of results of the interpretation (case multi-

cameras).• generation of test sequences (add the noisy phenomena)

for validating and establishing the limits of an AVIS.

Help the expert for describing new scenarios. Define an unified platform using the same models

for the interpretation and the test platform.

- 31 - 04/20/23

Part I&II: conclusion and perspectives

Visualisation of the interpretation: Scene context

A scene context is composed of 4 elements: • zones (e.g. zone of bench) with semantic information

(e.g. expected mobile objects): represented by polygons. • walls: represented by vertical polygons. • context objects (e.g. bench) with semantic information

(e.g. function of the object, time and distance of utilization): represented by 3D geometric primitives (sphere, truncated cone, parallelepiped).

• camera information: calibration matrix containing the parameters of the virtual camera (e.g. position, direction, FOV).

- 32 - 04/20/23

Issue: detection errors, non rigid objects, occlusions, merging and splitting of trajectories.

Approach: combining different types of tracking- frame to frame tracker: to compute correspondences between successive mobile objects.- individual tracker: tracking of specific individuals using time delay.- group tracker: global tracking of groups of persons.

For example: a group of persons is defined as a set of individuals which has four characteristics:- special coherency: the mobile objects are close to each other.- size coherence: the mobile objects are bigger than a person.- temporal coherence: the motion of mobile objects corresponds to the motion of a person.- structure coherence: the number and the size of the mobile objects are stable.

Enable to compute a reliable historic of all mobile objects.

Video interpretation: tracking of mobile objects

- 33 - 04/20/23

An animation combines and instantiates all previously defined scenarios:– scene context: with the set up values (e.g. colour of the

context objects).– actors with their set up values (e.g. position).– scenarios with the involved actors and their period of

occurrence.– virtual camera used to visualise the scene.– visualisation speed.

- 34 - 04/20/23

3 rd SINO-FRANCO WORKSHOP ON MULTIMEDIA AND WEB TECHNOLOGIES

Documents