Active Perception&
Mental Models
Nikolaos MavridisCognitive Machines
MIT Media Lab
Today’s Menu
I. VISIONII. ACTIVE PERCEPTION III. MENTAL MODELSIV. FUTURE STEPSV. CONCLUSION
I. Our VisionI. Our VisionTo build intelligent devices that can
cooperate with humans in a natural manner
And also: learn about humans!
• Key prerequisites:
– Language
– Mental Models of the world
– Multimodal Active Sensing
• Early examples:
– Ripley the robot, Elvis the lighting system, Intelligent car
General Setting:Internal model of world
"A greyhouse!"
S E N S O R YW O R LD
(Im perfec t, C hang ing)
E X TE R N A LR E A LITY(Fixed S pace tim e)
A C T IV EP E R C E P TIO N
(S ensory da ta & AC TIO N S !!)
M E N TA LM O D E LS(P artia l D escrip tions)
AC T IO N S
D ATA
II. Active PerceptionII. Active PerceptionCAP TU RE
S EG M E NTATIO N(C O LO R-BA SED )
FAC EDE TE CTIO N
S ALIE NT PO INTDE TE CTIO N
O BJE CTREC O G NIT IO N
V IS O R :PRO PO SA LS FO R
O BJEC T INS TA NTIA TIO N /U PDA TE / D ELETIO N
P RO PR IO CEP TO R :P RO PO S ALS FO R
O B JECT INSTAN TIAT IO N /UP DATE / DELE TIO N
M EN TALM O D EL
STER EO DEP THC ALCULA TIO NCA PTURE
2D REG IO NP ERM ANE NCE
2D FACEP ERM ANE NCE
IM AG IN ATO R :P RO PO S ALS FO R
O B JECT INSTAN TIAT IO N /UP DATE / DELE TIO N
U N D E RC O N S TR U C TIO N
(S AM E A S LEFTCHA NNE L)
S PEE CHR ECO G N ITIO N
U N D ERC O N ST R U C T IO N
Ripley’s Perceptual System
Cameras
• ELMO
• Panasonic KX-HCM280 (Pan/Tilt/Zoom)
Segmentation
• Probabilistic color-based
• Requires uniform background & objects :-(• Replacement: Yair Ghitza’s method
Face Detection
• Paul Viola’s algorithm:Cascade of classifiers, simple features
Salient Point Detection
• Koch/Itti algorithm:(multiscale color/intensity/edge maps)
• Bottom-up human attention model, neurosc.
Object Recognition
• Andre Ribeiro’s algorithm
• Robust to rotations, background…• Andre will tell you more!
Stereo Depth Calculation
• SRI Small Vision System:Stereo engine using area corellation
• Calibration & filtering!
Region/Face Permanence
• “Objecter”: 2D permanence across frames
• Hysterisis before creation/deletion
• Finds optimal across-frame correspondence,
based on color/position/size metric
• Keeps indices across frames
Visor: Proposals for 3D objectinstantiation/update/deletion
• Gets state of the world from mental model
• Compares with evidence, proposes changes
• Stochastic / voxel descriptions, too…
…includes Voxeliser!
Voxeliser
• Shape estimation system using “sculpture”by multiple views (app: spatial domains)
Active Perception
Bottom-up feed-through vs. on-demand active!
(also integrating bottom-up with top-down)
• Theory: visual routines, next best view etc.
• Next Action: current cost & goal-based utility…
• Two models: Resolver, Spectator
Resolver: To ask or to sense?Planning to integrate Speech and Sensorimotor Acts (ICMI ’04)
Early motivation: Disambiguating referents
“Hand me the ball!”
Resolver
• Selects the next action:Question or sensory measurement
• Probabilistic model with one-step planning:Utility (goal-oriented information gain) vs. Cost
• Human-like performance, double matching, 25% cost gain!
Resolver: A screenshot
• After: “The heavy one” - “Is it small? No” - measuresize1-3 - “Is it medium?”
Spectator
• Bottom-up attention guiding camera movement
(Alexander Patrikalakis (UROP) & Nikolaos Mavridis)
• Finds & tracks interesting pointszooms in, marks on map, goes on!
III. Mental ModelsIII. Mental ModelsMOTIVATION:
How are people able to think about things that are
not directly accessible to their senses at the moment?
What is required for a machine to able to talk about things that are:
out of sight,
happened in the past, or
view the world through somebody else’s eyes (and mind)?
What is the machinery required for the comprehension of:
“Give me the green beanbag
that was on my left!”
Mental models - why? (p.I)Goal: Provide an intermediate representation, mediating between perception, language and action.• In essence:
– an internalized representation of the state of the world as best known so far, in a form convenient for “hooking up” language
(shown below: the revisualisation of the rep)
– and a set of methods for updating this representation given further relevant sensory data, and predicting future states in the absence of such data
Mental models - why? (p.II)
• But also:– A useful decomposition of a complex problem:
a practical engineering methodology with reusable components
a theoretical framework (dynamical systems)
– A unified platform for the instantiation of hypothetical scenarios:
planning (goal state descriptions)
instantiation of situations communicated through language etc
– A starting point for experimental simulations of:
Multi-agent systems, Theory of mind, Learning
Ripley’s “Internalised World”(early version: IEEE SMC)
Object Permanence & Viewpoint Switching
RED
PROTOTYPES(Coding, tuition)
SPACETIME
SENSORYWORLD
WORLD(COMPOUND_AGENT)
AGENTs
...
AGENT_RELATIONs
...
AGENT AGENT_RELATION
BODY(COMPOUND_
OBJECT)
SOULINTERFACE
OBJECTs
OBJECT_RELATIONs
VIEWPOINT
MOVER
MENTALMODEL
GOALS
AFFECT
Objects & attributes
3 Layers of Attributes: (shape, color, weight… apparent/deep)– Stochastic – knowing how much you know!!!: for language, curiosity…
– Deterministic - maximum likelihood
– Categorical - quantized for language: “red”, learnt and ctxt-dependent!
EXAMPLE: STOCHASTIC RADIUS AND POSITION
The Architecture
M E N T A L M O D E L& R E C O N C IL L IA T O R
(m e n ta l_ m o d e l.e x e )W [ t ] a n d F
M O D A L IT Y -S P E C IF IC
IN S T A T IA T O R S(v is o r .e x e e tc . )
(W [ t ] ,S [ t ] ) -> W s [ t ]
V IR T U A LO B J E C T
IN S T A N T IA T O R( im a g in e r .e x e )
(W [ t ] ,H [ t ] ) - > W h [ t ]
D Y N A M IC SP R E D IC T O R
(p re d ic to r .e x e )W p [ t ]
V IS U A L IS E R(v is u a lis e r .e x e )
S E N S E SS [ t ]
H Y P O T H E S ISG E N E R A T IO N
V IS U A LF E A T U R EA N A L Y S IS
L A N G U A G EU N D E R S T A N D IN G
(b is h o p )v ie w p o in ts e le c t io n
M E N T A L M O D E L S : R ip le y 's c a s e
P r e l im in a r y b lo c k d ia g r a m , S e p t '0 3N ik o la o s M a v r id is , M IT M e d ia L a b
• Modality-specific processes:
– Visor
– Proprioceptor
– Imaginator
• Central processes:
– MM: Processes proposals
– Predictor
• Recent Work: Goals,Affect
• Open Questions:
- Cognitive spacetime
- Comms etc.
Evaluating performance
• Ground truth: Flock of birds sensors
(Stephen Oney (UROP) & Nikolaos Mavridis)
• Measure systematic errors, noise, time delay, dynamics… & calibrate parameters!
IV: Future StepsIV: Future Steps• Imaginator: Language to mental model!
• Voxelizer: Better shapes and categories
• Resolver: Full integration & active sensing
• Multiagent, Theory of Mind, Innate vs. learnt…
• Parts of soul: Affekt & goal modelling
Multiagent systems
• Prerequisites:– Action recognition across agents
(not strict prereq)– Thus, useful to start by embedding
everything in virtual world wrapper,and cheating on action recognition
– Also, mixed real/virtual agents (Ripleyconversing with a non-existent friend)
• Benefits:– Systematic external examination of effects of different partial world knowledge or
structure/methods of mental models (I.e. contents & form of MM), or even different sensory organs.
– For example, differing categorical boundaries and negotiated alignment (methods difference, I.e. update/prediction function)
– Prerequisite for Theory of Mind!• First preliminary examples:
– Ashwani’s demo for viewpoint-dependent description generation (using the generic MM)
Theory of mind
• Now, each agent’s MMalso contains an estimated mental model of each other agent as part of their descriptions…
• Prerequisites:– Uncertainty – Multi-agent models– Action recognition across agents (strict prereq now!, +gaze)
• Benefits:– Start playing with intention though action recognition – Interesting coupling with inferred goals etc.– “Mind reading” is an immense area for experimentation!– Collaborative tasks
Innate vs. learnt• Now that we have a clean architecture to start with
how about learning parameters or structuresof the architecture, and experimenting withlearned vs. innate (predesigned or evolved) tradeoffs?
• Examples:– Learning predictive dynamics
• Where do I expect the object to be?• Learning “empirical” newtonian mechanics
– Learning senses-to-model maps• Which property of which object does this sensory signal
inform me about, and how do its contents alter the property?– Learning language-to-model maps (example: Deb’s thesis)
• Which property of which object does this utterance inform me about, and how does it alter the property?– Learning mental model structures
• Which properties should my object descriptions contain?• How can I get an empirical derivation of 3D position as a crucial non-apparent property of an object?
– Concatenating parts at the input-output equivalence level• Forget about all the internalised fuss. Can I get an equivalent structure without postulating and enforcing the exact
architecture?
• In essence: – How arbitrary is everything that was hardcoded? Are some things redundant? Can they be learnt? If so,
How?• FINALLY, FOR ALL PREVIOUSLY STATED FUTURE PLANS:
– Relation with how humans perform (cognitive modeling) - categorical level
V: ConclusionV: Conclusion
The Picture!
General SettingInternal model of world
"A greyhouse!"
S E N S O R YW O R LD
(Im perfec t, C hang ing)
E X TE R N A LR E A LITY(Fixed S pace tim e)
A C T IV EP E R C E P TIO N
(S ensory da ta & AC TIO N S !!)
M E N TA LM O D E LS(P artia l D escrip tions)
AC T IO N S
D ATA
The ultimate goal:
Why Active Perception & Mental Models?The ultimate goal is clear:
• Let’s make Ripley and co. more fun to interact with!• And let’s learn more about us on the way…