PhD Presentation Biologically-inspired Models
for Learning Agents
web.ist.utl.pt/~pedro.sequeira/phd
Introduction
Motivation
Case Studies
Conclusions
web.ist.utl.pt/~pedro.sequeira/phd
Introduction
Motivation
Case Studies
Conclusions
web.ist.utl.pt/~pedro.sequeira/phd
Prof. Francisco Melo as thesis co-supervisor
CAT in mid-July
Objectives
General problem
General solution
Focus on case studies / experiments
Main idea
Provide learning models to autonomous agents
Inspired on biological models
web.ist.utl.pt/~pedro.sequeira/phd
Introduction
Motivation
Case Studies
Conclusions
web.ist.utl.pt/~pedro.sequeira/phd
Definitions [Franklin & Graesser, 1997; Maes, 1994]
situated in dynamic environments
have and actively pursue goals
satisfy their needs
respond to external events from the environment
MAS - live and interact with
other agents
web.ist.utl.pt/~pedro.sequeira/phd
Requirements [Franklin & Graesser, 1997; Maes, 1994]
mechanisms to distinguish perceived features
focus on relevant features, ignore non-important ones
adapt to and learn new knowledge from the environment
take the right action at each decision time
structures that represent the acquired knowledge
update representations overtime to reflect experience
web.ist.utl.pt/~pedro.sequeira/phd
Building Agents
Key is ADAPTATION
Provide prior knowledge
sufficient for the agent to perceive its environment
Use learning mechanisms
update the agent’s knowledge
web.ist.utl.pt/~pedro.sequeira/phd
Problems
Prior knowledge
lots of pre-programming of behaviors
large knowledge bases
Perceptual limitations
world dynamics, good states
Acting limitations
good actions
Learning
which paradigm / framework to use?
web.ist.utl.pt/~pedro.sequeira/phd
Parallel between natural and artificial agents
Inhabit highly dynamic environments
Have to make complex decisions under uncertainty
Limited perceptual and acting capabilities
Focus on important events
Live in organized societies
web.ist.utl.pt/~pedro.sequeira/phd
Inspiration from biological models
Evolutionary adaptive mechanisms
Simple but powerful survival tools
Improve performance with experience
Take the most of the perceived information
Lead to a greater fitness
web.ist.utl.pt/~pedro.sequeira/phd
Inspiration from several research areas
Psychology
Biology
Ethology
Neuroscience
…
web.ist.utl.pt/~pedro.sequeira/phd
Classical conditioning in RL
Improve learning speed
State-space reduction
Emotion-based Intrinsic Motivated RL
Single-agent event-processing mechanism
Use emotions as intrinsic rewards
Clues from agent-environment relationship
Improve agent fitness
Socially-aware IMRL
Multi-agent social processing mechanism
Use affiliation / cooperation
Improve population fitness
web.ist.utl.pt/~pedro.sequeira/phd
Introduction
Motivation
Case Studies
Conclusions
web.ist.utl.pt/~pedro.sequeira/phd
Inspired from animal learning
Teach an animal to respond in certain way
Provide reward and punishment appropriately
Main Ideas [Sutton & Barto, 1998]
Learn from experience
Situations + Actions → Reward
Reward is external feedback signal
Objective: maximize the reward receive throughout time
Task: discover which actions maximize reward in each state
Trial-and-error search
Mind subsequent (delayed) rewards
web.ist.utl.pt/~pedro.sequeira/phd
Main Idea
Inspiration from classical conditioning paradigm
Partition observations into stimuli
Propose a measure for distance between states
Learn the value of states based on proximate
states
Propagated learning
Reduce space-state
Reduce learning time
web.ist.utl.pt/~pedro.sequeira/phd
Classical Conditioning [Pavlov, 1927]
Advantages Contingency between stimuli in the environment
Independent of the animal's behavior
Animal does not learn behavior consequences
Predict the outcomes of new events from already-known situations
Create new contexts for behavior activation
US
food delivery
UR
salivation
CS
bell
CR
conditioned salivation
CS
bell
training…
web.ist.utl.pt/~pedro.sequeira/phd
Model Based on Sensory Pattern Mining [Sequeira & Antunes, 2010]
Partition observations into stimuli
e.g. see bone, has ball, hear “Fetch!”
Build tree containing frequent patterns
web.ist.utl.pt/~pedro.sequeira/phd
Model Use the Jaccard index [Jaccard, 1912]
frequency of intersection between stimuli over
frequency of union of the stimuli:
Advantages Sensible to particular correlations between stimuli
Rapid access to frequent patterns
web.ist.utl.pt/~pedro.sequeira/phd
Learning Model Extend Q-learning algorithm [Watkins, 1989]
Determine similar states using the pattern tree
State distance measure:
Propagated multi-state update of values
New state receives information from similar states
web.ist.utl.pt/~pedro.sequeira/phd
Experiment
Inspired in animal training
stimuli: 3 visual, 2 tactile, 2 auditory
actions: Pick, Drop, Eat,
Approach Trainer, Approach Ball
4 phases: acquisition, extinction,
association, substitution
Objectives
form associations between co-occurrent stimuli
evoke innate responses in new stimuli
discovery of new contexts for already-known responses
web.ist.utl.pt/~pedro.sequeira/phd
Main results Faster initial learning
Secondary conditioning (e.g. “Fetch!” heard in more cells)
New contexts for actions (e.g. Eat when bone is present)
web.ist.utl.pt/~pedro.sequeira/phd
Main Ideas [Singh et al., 2010]
Reward behaviors rather than consequences
Agent receives augmented reward
extrinsic reward
“normal” reward in RL, related with task (e.g. fulfillment of
needs)
intrinsic reward
does not directly relate with the task (e.g. play or explore)
Objective: maximize total reward
web.ist.utl.pt/~pedro.sequeira/phd
Main Idea
Inspiration from emotional appraisal mechanisms
Mathematical adaptation of dimensions
Emotions as intrinsic rewards
Integrate with IMRL framework
Provide clues from agent-environment relationship
Enhance single agent fitness
web.ist.utl.pt/~pedro.sequeira/phd
Emotions [Dawkins, 2000; Cardinal et al., 2002]
Evolutionary adaptive mechanism
Combined with learning signal advantageous and dangerous situations
help when seeking food and avoiding harm
Bias decision making [Naqvi et al., 2006]
maximizing reward and minimizing punishment
In humans [Phelps & LeDoux, 2005]
memory enhancement, sensory plasticity, attention facilitation,
regulation of social behavior, regulation and inhibition of
emotional responses
web.ist.utl.pt/~pedro.sequeira/phd
Appraisal theories of emotion [Ellsworth & Scherer, 2003; Leventhal & Scherer, 1987]
Emotions arise from evaluations
Characterize subject-environment relationship
Significance for the person’s well-being or goals
Appraisal dimensions
each dimension evaluates a specific aspect
web.ist.utl.pt/~pedro.sequeira/phd
Model of emotions in IMRL
Inspired in appraisal theories of emotions
Adopt four common appraisal dimensions
novelty, motivation, valence, control
each evaluates agent-environment relationship
numerical value represents dimension activation
Use dimension adaptations as reward features
each feature is component of intrinsic reward
web.ist.utl.pt/~pedro.sequeira/phd
Affective reward features
Adaptation from Major Dimensions of Appraisal [Ellsworth & Scherer, 2003]
intentionally did not adapt social dimensions
Problem: appraisal theories usually deal with high-level
psychological processes
complex concepts (e.g. causal attribution, norms)
Solution: inspiration from the Multilevel Process Theory Of
Emotion [Leventhal & Scherer, 1987]
appraise events at different levels
emotions from reflex-like responses into complex cog. Patterns
Evaluate aspects of the agent’s history of interaction
web.ist.utl.pt/~pedro.sequeira/phd
Affective reward features
Novelty: degree of familiarity of events
Valence: innate pleasure detector, learned preferences
Motivation: relevance of event for goals or needs
Control: degree of correctness of the world-model
web.ist.utl.pt/~pedro.sequeira/phd
Experiments
Grid-world scenarios inspired in foraging environments
agent is a predator, tries to eat preys in the environment
observations: cell position, see prey
actions: N, S, E, W, Eat
Dyna-Q/prioritized sweeping alg. [Moore & Atkeson, 2003]
Objectives
maximize the agent’s fitness (extrinsic reward)
optimize feature weight vector
web.ist.utl.pt/~pedro.sequeira/phd
Exploration scenario
One prey
Eat prey, rext=1
Non-Markovian
Results
optimal weight vector
optimal fitness: 1.902,2
"only extrinsic" fitness: 135,9
web.ist.utl.pt/~pedro.sequeira/phd
Persistence scenario Two preys: rabbit and hare
Eat prey
rabbit: rext=0,1; hare: rext=1
Fence
n North actions, next time, n+1
Non-Markovian
Results
optimal weight vector
optimal fitness: 1.020,8
"only extrinsic" fitness: 25,4
web.ist.utl.pt/~pedro.sequeira/phd
Prey-season scenario Two preys: rabbit and hare
Eat prey
rabbit: rext=0,1; hare: rext=1
Two seasons: rabbit and hare
10.000 steps
if 10 rabbits eaten, rext=-1
Non-Markovian
Results
optimal weight vector
optimal fitness: 5.203,5
"only extrinsic" fitness: 334,2
web.ist.utl.pt/~pedro.sequeira/phd
Different rewards scenario
Two preys: rabbit and hare
always available
Eat prey
rabbit: rext=0,1; hare: rext=1
Markovian
Results
optimal weight vector
optimal fitness: 87.925,7
"only extrinsic" fitness: 87.890,8
web.ist.utl.pt/~pedro.sequeira/phd
Conclusions
Intrinsic reward features based on emotional appraisal
Guide the agent during learning
Focus on specific aspects of the environment
Balance between different strategies
Bring attention to advantageous states
Ignore not so favorable states
web.ist.utl.pt/~pedro.sequeira/phd
Main Idea
Integrate with IMRL framework
Multi-agent scenarios
Inspiration from affiliation and altruism
Mathematical adaptation of social signals
Emergence of socially-aware behaviors
Raise the fitness of the population
Even the fitness of each agent
web.ist.utl.pt/~pedro.sequeira/phd
Affiliation [Dörner, 1999; Bach, 2009]
Urge to affiliate / interact with other agents
Send and receive legitimacy signals
reward socially-acceptable behaviors (l-signals)
punish unsuccessful interactions (anti l-signals)
internally reward or punish socially-aware behaviors
(internal l-signals)
Altruism [de Waal, 2008]
Intrinsic reward when benefit for the social group
initial cost but subsequent compensation
web.ist.utl.pt/~pedro.sequeira/phd
Model for socially-motivated learning
Fitness measured at the population level:
Intrinsic reward: two social features
external signal: received from other agents, based on l-signal
internal signal : generated by the agent, based on internal l-signal
represent level of satisfaction of affiliation need
Total reward
web.ist.utl.pt/~pedro.sequeira/phd
Social features for limited resource scenarios
Extrinsic reward
rext = IsFull – 0.1 IsHungry
External reward feature
rsExt = LastToEat AND SeeFood AND SeeOther AND !Eat
Internal reward feature
rsInt = LastToEat AND SeeFood AND !Eat
web.ist.utl.pt/~pedro.sequeira/phd
Experiments
Grid-world scenarios inspired in foraging environments
two predator agents
observations: position, SeeFood, SeeOther, LastToEat, IsHungry
actions: N, S, E, W, Eat
rewards: reat = 1, rhungry = -0.1
agents become hungry after 30 timesteps
Objectives
maximize the population fitness (sum of extrinsic rewards)
optimize feature weight vector
web.ist.utl.pt/~pedro.sequeira/phd
Single-food scenario
One food resource
Agent that eats starts closer
to food resource (bottom-right)
Results
optimal weight vector
optimal fitness: 3.249,2
"only extrinsic" fitness: -19.991,3
web.ist.utl.pt/~pedro.sequeira/phd
Equal-resource scenario
Two food resources
Agent that eats starts bottom-right
Possibility of both eating
Results
optimal weight vector
optimal fitness: 18.178,8
"only extrinsic" fitness: -2.296,9
web.ist.utl.pt/~pedro.sequeira/phd
Stronger-agent scenario
One food resource
Both start bottom-right
One agent is stronger
When both try to eat,
only one succeeds
Results
optimal weight vector
optimal fitness: 2.656,1
"only extrinsic" fitness: -1.164,9
web.ist.utl.pt/~pedro.sequeira/phd
Introduction
Motivation
Case Studies
Conclusions
web.ist.utl.pt/~pedro.sequeira/phd
Biologically-inspired learning models Provide built-in prior knowledge
Learning framework based on RL and IMRL
Rewards based on agent-environment relationship
Results Speed up learning
State-space reduction
Intrinsic features provide clues on important aspects
Lead to different strategies
Not directly related to, but increase fitness
Lead to “socially-aware” behaviors
web.ist.utl.pt/~pedro.sequeira/phd
Improve classical conditioning model
Support more learning paradigms
Improve multi-agent model
Inspiration on cooperation
Evolutionary Game Theory
CAT…