Performance of heterogeneous robot teams with personality adjusted learning

Biologically Inspired Cognitive Architectures (2014) 7, 87–97

Avai lab le a t www.sc ienced i rec t .com

ScienceDirect

journal homepage: www.elsev ier .com/ locate /b ica

RESEARCH ARTICLE

Performance of heterogeneous robot teamswith personality adjusted learning

2212-683X/$ - see front matter ª 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.bica.2013.10.003

* Corresponding author. Tel.: +1 862 219 5193.E-mail addresses: [email protected] (T. Recchia), jae.

[email protected] (J. Chung), [email protected](K. Pochiraju).

Thomas Recchia *, Jae Chung, Kishore Pochiraju

Department of Mechanical Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, United States

Received 13 September 2013; accepted 26 October 2013

KEYWORDSMulti-agent system;Reinforcement learning;Myers–Briggs TypeIndicator;Robot teaming;Heterogeneous robot team

Abstract

This paper presents a reinforcement learning algorithm, which is inspired by human teamdynamics, for autonomous robotic multi agent applications. Individual agents on the team haveheterogeneous capabilities and responsibilities. The learning algorithm assigns strictly localcredit assignments to individual agents promoting scalability of the team size. The PersonalityAdjusted Learner (PAL) algorithm is applied to heterogeneous teams of robots with rewardadjustments modified from earlier work on homogeneous teams and an information-basedaction personality type assignment algorithm has been incorporated. The PAL algorithm wastested in a robot combat scenario against both static and learning opponent teams. The PALteam studied included distinct commander, driver, and gunner agents for each robot. The per-sonality preferences for each agent were varied systematically to uncover team performancesensitivities to agent personality preference assignments. The results show a significant sensi-tivity for the commander agent. This agent selected the robot strategy, and it was noted thatthe better performing commander personalities were linked to team oriented actions, ratherthan more selfish strategies. The driver and gunner agent performance remained insensitiveto personality assignment. The driver and gunner actions did not apply at the strategic level,indicating that personality preferences may be important for agents responsible for learningto cooperate intentionally with teammates.ª 2013 Elsevier B.V. All rights reserved.

Introduction

As robotic agents become more capable and less expensive,there is an increasing potential for teams of robots to inter-act with humans and each other in order to cooperativelyachieve tasks. One way to compose a high performanceteam is to implement multi agent learning for the agents,

http://crossmark.crossref.org/dialog/?doi=10.1016/j.bica.2013.10.003&domain=pdf

http://dx.doi.org/10.1016/j.bica.2013.10.003

mailto:<xml_chg_old>[email protected]</xml_chg_old><xml_chg_new>[email protected]</xml_chg_new>

mailto:<xml_chg_old>[email protected]</xml_chg_old><xml_chg_new>[email protected]

mailto:<xml_chg_old>[email protected]</xml_chg_old><xml_chg_new>[email protected]

mailto:[email protected]</xml_chg_new>



www.sciencedirect.com

http://www.elsevier.com/locate/bica

88 T. Recchia et al.

so that the agents themselves can learn the best policies towork together. Systems in which each agent is responsiblefor learning its own policies are termed concurrent learningsystems (Panait & Luke, 2005). These systems have the ben-efit of open-ended scalability when the design of individualagents is not linked to the design of a number of otheragents in the team. Also, when a human based psychologicalmodel for personality preferences is incorporated into theagents, this approach applies directly to the BICA Challengeto create a real-life computational equivalent of the humanmind (Samsonovich, 2012). We explore the effectiveness ofhuman-like personality preferences to change learnedbehaviors and improve team performance, which is envi-sioned to lead to improved implicit and naturally occurringcooperation with other agents and eventually with humanteammates.

In cooperating heterogenous teams, determining how theactions of individual agents influence the achievement ofthe team goals is quite difficult. This has been referred toas the credit assignment problem (Panait & Luke, 2005),and several approaches to solving this problem have beenexplored in the literature (Agogino & Tumer, 2006; Balch,1997; Chang, Ho, & Kaelbling, 2003; Kalyanakrishnanet al., 2009; Mataric, 1994; Makar, Mahadevan, &Ghavamzadeh, 2001; Santana, Ramalho, Corruble, & Ratitch,2004; Tangamchit, Dolan, & Khosla, 2002; Tumer, Agogino,& Wolpert, 2002; Tumer & Agogino, 2006; Wolpert & Tumer,2001). Many of these approaches use reinforcement learn-ing, which defines a reward scheme that enable the agentsto learn cooperation. Recently, the authors (Recchia,Chung, & Pochiraju, 2013) have developed a local rewardscheme inspired by human teaming concepts which wasshown to promote team development in a scarce resourcegathering task for a team of agents with homogeneous capa-bilities. This paper extends that work by adapting the algo-rithm to be used on a team of agents with heterogeneouscapabilities in a combat scenario, and investigating its ef-fects on team performance.

Background

In the current investigation, a team of heterogeneousagents capable of adapting through personality adjustedreinforcement learning are studied in a combat scenarioagainst both a static and learning opponent team. The capa-bility of the agents to learn to take the best actions to in-crease team performance is measured in order to evaluatethe effect of assigning various personality preferences tothe different types of agents. Because this investigationintegrates ideas from various backgrounds, a brief overviewof the relevant concepts is warranted.

There are five important aspects to the current investiga-tion. The first aspect pertains to agent learning. Q-learningwas selected for this study as a representative type of rein-forcement learning. The second aspect is related to thesolutions of the credit assignment problem for cooperativemulti agent systems. The third is the application of person-ality types as inspiration for the agent teaming scheme. TheMyers–Briggs Type Indicator (MBTI) (Myers & Myers, 1995),which is a human psychology tool, is explored in this inves-tigation. The fourth aspect is related to the use of an

information based model (Lowen, 1982) to classify actiondecisions into the MBTI structure. The fifth aspect is theimplementation framework for conducting performancesimulations. Each of these aspects is discussed in thissection.

Q-learning

One of the widely used adaptive algorithms for robotcontrol in dynamic environments is Q-learning (Arkin,1998; Sutton & Barto, 1998; Watkins & Dayan, 1992). In thisalgorithm, a robotic agent is characterized by a state vec-tor, ~x, and can choose to perform an action, ai 2~a, whichis the set of all possible actions. The Q-function defines avalue that represents the utility of the action, ai, giventhe current state, ~x. Normally, the problem is discretized,so the Q-function is represented by a table of values forall possible state-action combinations. This table is calcu-lated recursively on-the-fly by the agent as it performs ac-tions and evaluates a reward/punishment equation thatdepends on the agent’s goal. The standard update equationfor the Q-function is:

Qnewð~x; aiÞ ¼ Qð~x; aiÞ þ aðr þ cmaxðQð~y;~aÞÞ � Qð~x; aiÞÞ ð1Þ

where a is the learning rate parameter that controls howquickly the agent learns; r is the reward or punishment re-ceived; c is the discount factor that controls how muchthe agent plans for future states; maxðQð~y;~aÞÞ is the utilityof state ~y, which results from taking action ai from state ~x.It is the maximum value of Qð~y;~aÞ, over all possible actions,~a. Every time an agent takes an action, one entry in its Q-function is updated to reflect the utility of taking that ac-tion from the state the agent was in at the time. In thisway, the Q-function represents the current estimate ofthe optimal policy for the agent to follow to achieve its goal(Sutton & Barto, 1998; Watkins & Dayan, 1992).

Often the agent is trying to achieve its goal even beforethe Q function has completely converged to an optimal pol-icy. In this case it is necessary for it to decide if it shouldfollow the current policy, or to try something new at any gi-ven decision point. One popular algorithm to handle this iscalled the e-greedy policy. In this policy, a parameter be-tween 0 and 1, e, is set by the designer. During execution,a random draw between 0 and 1 is made by the agent. Ifthe drawn number is less than e, the agent tries a randomaction to explore its options. If it is greater than e, the agentexploits the current Q function recommended action. Thisensures that as the number of actions taken goes to infinity,all state action pairs are visited (Sutton & Barto, 1998).

Credit assignment

Several researchers investigated the credit assignmentproblem using global rewards to enable relatively smallteams of agents to learn cooperative behaviors. Balch(1997) and Tangamchit et al. (2002) studied the effect ofusing global versus local rewards on teams of concurrentlylearning agents with reinforcement learning (RL). Balchinvestigated the emergence of teaming behavior on a glob-ally rewarded soccer team and reported on the superior per-formance of the globally rewarded team versus the locallyrewarded one. Tangamchit et al. compared the perfor-mance of globally versus locally rewarded agent teams in

ISTJISTJ

ISTP

ESTP

ESTJ

ISFJ

ISFP

ESFP

ESFJ

INFJ

INFP

ENFP

ENFJ

INTJ

INTP

ENTP

ENTJ

Sensing Types Intuitive Types

Intro

verts

Extro

verts

withThinking

withFeeling

withFeeling

withThinking

Judging

Perceptive

Perceptive

Judging

Fig. 1 The sixteen Myers–Briggs types arranged to show theirrelationship to each other.

Performance of heterogeneous robot teams with personality adjusted learning 89

a resource gathering task and found that the globally re-warded team performed better through the ability to learna more optimal and cooperative behavior than the locallyrewarded team. In an investigation into a cooperative forag-ing task, Mataric (1994) added a third type of reward thatrewarded an agent for copying the behavior of its team-mates. This allowed the team of three agents to learn socialrules, such as giving way to a teammate and sharing knownresource locations with the teammates.

Wolpert and Tumer (2001) investigated a reward functioncalled the Wonderful Life Utility (WLU), that calculated theagent reward by subtracting the team utility without theagent from the team utility with the agent. They reportedsuperior performance from this algorithm on the bar selec-tion problem, where agents represent People determiningwhich bar to go to after work. In the selection policy, barsthat are empty or too crowded have less utility than barswith a moderate attendance and higher social value to theagents. Tumer et al. (2002) demonstrated superior perfor-mance of the WLU reward for a resource gathering task.Tumer and Agogino (2006) showed superior performanceof a WLU-type algorithm over local and global rewards forvery large agent teams. They investigated 1000 agentschoosing departure times and routes through a traffic sys-tem, with the goal of minimizing all agents’ individual traveltimes by increasing throughput. Agogino and Tumer (2006)developed an adaptation of the WLU function, called QUI-CR-learning, where agents assign credit at each time step,instead of just when goals are achieved. They show superiorperformance of QUICR-learning over local reward Q-learn-ing, global reward Q-learning, and WLU-learning, for largenumbers of agents in a traffic congestion alleviation prob-lem and a foraging problem.

Chang et al. (2003) presented an alternative solution tothe reward credit problem for reinforcement learningagents in a team situation. They used global rewards, buteach agent uses a Kalman filter to estimate its local rewardbased on its contribution to the team effort. Their algo-rithms show good results for single agents, multi agentteams in a grid world move-to-goal scenario, and a commu-nication network task.

Kalyanakrishnan et al. (2009) presented successful multiagent learning of varied tasks in a soccer keep-away domainthrough role assignment. The agents can fill one of two rolesand each role uses a different learning algorithm to developa policy.

Makar et al. (2001) presented a hierarchical frameworkfor using reinforcement learning to allow cooperative col-laboration between agents. They showed that in simulatedresults of two different foraging-style tasks, their approachtrains faster and yields better performing teams than both asingle-agent case and a multi-agent case that learnsselfishly.

Santana et al. (2004) investigated the use of reinforce-ment learning in allowing both single and multi-agent teamsto learn to patrol an area and minimize the average time be-tween visits at the nodes in the network. They derived a lo-cal reward which includes the total idle time of the currentnode, which implies a simple form of communication––anagent must post the time it visited a node, at the time ofthe visit, so later agents can calculate the elapsed timebetween visits when they reach the same node. They

presented results showing that both single and multi agentteams can use this algorithm to learn to perform the patroltask effectively, using average node idle time as a metric.

Recchia et al. (2013) presented an algorithm termed aPersonality Adjusted Learner (PAL) by which an agent is gi-ven a personality which modifies the rewards it earns in or-der to affect the learned behaviors of agents on a team. ThePAL algorithm was tested in a team of physically homoge-neous agents attempting a resource gathering task. It wasshown that a given team’s performance could be signifi-cantly improved over a baseline team with no personalitytraits, through selection of certain personality designations.

Myers–Briggs Type Indicator

The Myers–Briggs Type Indicator (MBTI) is a tool used toevaluate human personality traits or learning preferences.It was developed by Isabel Briggs Myers and Katherine CookBriggs during World War II with the intention of helping wo-men with no prior work experience to find fulfilling andrewarding jobs, while most of the men were fighting thewar (Myers & Myers, 1995). Briggs and Myers adapted theideas of the famous psychologist, Carl Gustav Jung, in orderto generate a questionnaire-type instrument that wouldhelp People gauge their own personality type. They alsoconducted research into which professions were generallypopulated by which types and which types were naturallyattracted to what kind of work (Myers & Myers, 1995). TheMBTI is currently quite popular in the consulting industry,where it is used to help teach team building concepts andthe value of having multiple types represented on a team(Wideman et al., 1998).

The MBTI is based on a four dimensional description ofthe learning preferences of a person (Myers & Myers,1995). Although the system allows for continuous variationon each of the four axes, People are generally grouped bythe polarity of their rating on each axis. This leads to six-teen ‘‘personality types’’, as shown in Fig. 1, each de-scribed by a four letter label that indicates thepreferences of that type.

The first dimension is Extrovert–Introvert, representedby an ‘‘E’’ or an ‘‘I’’ in the type descriptor. According toMyers and Myers (1995), Extrovert’s actions start with theexternal world and their environment, while Introvert’s ac-tions start with their internal world of ideas and mental con-cepts. Often the Extroverts prefer environments with lots ofstimuli and are comfortable reacting to external situations.Introverts, however, prefer to classify the external situation

Table 1 Diagram of Lowen’s four pole model mapped toJungian concepts.

Concrete Abstract

People Feeling IntuitionThings Sensing Thinking


into an internal representation before acting. This is quitedifficult in environments with lots of external stimuli, andthe Introverts tend to ignore some of this external data asthe only way to fit the situation into one of their internalrepresentations. This apparent weakness is offset by theirability to bring to bear their internal understanding of theunderlying ideas of the situation.

The next dimension addressed is the Sensing–Intuitiondimension, represented by an ‘‘S’’ or an ‘‘N’’ in the typedescriptor. According to Myers, Sensing types prefer to fo-cus their attention on sensed factual data about their world.They are meticulous in their perceptions and prefer to relyonly on hard facts. Intuitive types rely on their subconsciousto relate their perceptions and are happy to perceive theirworld through symbols and abstractions.

The third dimension is the Thinking–Feeling dimension,represented by a ‘‘T’’ or ‘‘F’’ in the type descriptor.According to Myers, Thinking types base their judgmentson logic and truth, while Feeling types base their judgmenton the moral value of the judgment. Thinking type judg-ments work well in situations where an objective truth orrule can be defined. However, in teaming situations themoral value of judgments can be extremely important.

The last dimension is the Judging–Perceiving dimension,represented by a ‘‘J’’ or ‘‘P’’ in the type descriptor.According to Myers, Judging types prefer to make conclu-sions and take action based on their perceptions, even ifthey have incomplete information. They often have plannedout their actions, sometimes well in advance. Conversely,Perceiving types understand there are many aspects to a sit-uation and strive to perceive them all before reaching aconclusion. They will often leave things open-ended so theycan react as new information is learned.

These four dimensions interact with each other to pro-duce sixteen types as shown in Fig. 1. Each type has itsown preferred way of dealing with the world and accom-plishing tasks. Often, when a team comprises multiple dif-ferent types, a constructive interaction can be createdthat enables the team to take advantage of the strengthsof each of its members. This allows the team to use the bestpsychological tools it has to address each individual problemit must solve or task it must complete.

Often it is useful to focus on two dimensions at a time, todevelop a deeper understanding of their interactions witheach other. This approach was taken by Myers and Myers(1995), and will be used in this paper to simplify the analysiswithout losing the ability to later extend the concepts to theother two dimensions. Because the scenario under testrelates best to the Sensing–Intuition and theThinking–Feeling dimensions, and because they are easilycombined with the information flow model described in thenext subsection, these dimensions were chosen for analysis.

Using information flow to determine MBTI type

Lowen (1982) develops an information based model of thehuman mind, and demonstrates its correlation to basicJungian concepts. He focuses on what he calls dichotomies,ways of partitioning a mind into two opposing halves. Heeventually describes what he calls a Sixteen-Pole Model,that has a strong but not perfect correlation to Myer’s six-teen personality types. In his development, he starts with

a Four-Pole Model based on the dichotomies labeledPeople–Things and Concrete–Abstract. By treating theseas independent, orthogonal dimensions, he maps the fourcombinations of these two dimensions directly to the Jung-ian concepts of Feeling–Thinking and Sensing–Intuition asdepicted in Table 1.

Lowen’s model has been adapted for use in this paper asa way to ascribe personality preference traits to the actionsselected by each of the agents. Because his model is basedon information flow, this makes the personality trait repre-sented by each action deterministic and non-reliant on theagent designer’s interpretation of the MBTI trait designa-tions. The exact application of this concept to the scenariounder study is reserved for later discussion, followingdescription of the system architecture.

Simulation framework

The simulation framework developed is an adaptationfrom an open source robot simulation software calledRobocode (http://robocode.sourceforge.net). The simula-tions are performed in this Java-based robot battle gamesoftware, which is designed to simulate battle betweenrobot tanks, each with their own algorithms for defenseand attack. A schematic of a typical battle betweentwo robot teams is depicted in Fig. 2. This platform issufficiently open-ended to implement our simulationsand the only constraints are that the moves must be exe-cuted within a reasonable time per turn and that the ro-bots must abide by a limited data storage threshold.These constraints are quite loose and many complicatedalgorithms have been coded successfully (Alaiba & Rotaru,2008; Eisenstein, Evolving, n.d.; Frokjaer et al., 2004;Kobayashi, Uchida, & Watanabe, 2003; Nidorf, Barone, &French, 2010; Shichel, Ziserman, & Sipper, n.d.).

The robot class provided by the Robocode framework de-scribes a tank style robot with a base, turret, and radar thatcan each be rotated separately. A detailed drawing of theparts of the robot is shown in Fig. 3. The base determinesthe direction of movement, and can also be commandedto move forward or in reverse. Movement of the base is sub-ject to linear accelerations and decelerations, and rotationrates are applied instantaneously.

Each section of the robot has a different maximum rota-tion speed, and all rotations are carried out with respect tothe next-lower section. For instance, the radar turns thefastest, but it can be turned faster if the gun is also turningin the same direction. Sensing other robots is carried out bythe radar, and it is always scanning. If at any time the radaris pointed at an opponent, an event is generated that storessome information about the target, and an event listener iscalled so the robot can respond to the scan event. Similar

http://robocode.sourceforge.net

Fig. 2 Battle arena in the simulation showing two robotteams engaged in combat by firing bullets.

Hull

Turret

Radar

Fig. 3 Detail of the simulated robot showing the hull, turret,and radar.


events are generated for detecting collisions, detecting bul-let hits, etc.

Each robot starts the round with a fixed number of en-ergy points, and dies when its energy is brought to zero. Ro-bots lose energy when colliding with each other or the wallsof the arena, and when firing or being hit by bullets. Theonly way a robot can gain energy is for one of its bulletsto hit an enemy. The robot can fire a bullet in the directionthe turret is facing. The simulation of the robots is unreal-istic, due to simplified physics and the fictional energy valuethat determines which robots die. However, the scenario issufficiently complicated and equally applied to all robots,which allows the development and evaluation of variousagent architectures.

Table 2 Available agent states.

State name State 0 Stat

LowNRG Robot energy > 25 RoboBigDamage Last damage taken < 10 LastTargetRequests No recent communications ReceNearnessToTeam Mean team distance < 200 200NMEgrouping Std Dev of enemy positions < 200 Std

Methodology

The focus of this work is to investigate the effectiveness ofthe Personality Adjusted Learner (PAL) algorithm when ap-plied to teams of agents that are heterogeneous in theirabilities and responsibilities. To that end, PAL agents weredeveloped to handle the jobs of commander, driver, andgunner in a simulated battle tank, hereafter referred to asa robot. Teams of five robots were generated with variouspersonality trait settings, and their team performance wasrecorded for analysis. A detailed description of the agent,robot, and team architectures follows.

Simulation architecture

The simulation environment was implemented in Java withaccess to software objects that represent the robot andits parts. This allowed the new agents to be easily inte-grated into the simulation environment and allowed the fo-cus of the work to remain on agent development and agentinteractions.

Agent architecture

All agents on a team have the same possible states, and allagents within a particular robot have exactly the samestates. The states and the possible values they can takeare tabulated in Table 2. Distance units in this table are inarbitrary length units of the simulation framework. The ro-bot updates its states as part of its main loop, and the mostcurrent values are used by any of its agents when they aremaking an action decision.

Each of the agents has particular actions it can take,depending on its job. For instance, the commander has tomake decisions about which enemy to target, and whetherto set the radar to lock on the target or to continue sweep-ing a 360� arc. In addition, the information into and out ofthese actions were accounted for and rated according toLowen’s classes of People–Things and Concrete–Abstract.For the purposes of this scenario, teammates were consid-ered as ‘‘People’’, and enemies were considered as‘‘Things’’. In addition, values of things that exist, such ascurrent locations of enemies or friendlies were considered‘‘Concrete’’ and statistical or predicted information wasconsidered ‘‘Abstract’’. Actions were only rated for Lowenclasses that applied to the information types involved,which for the commander only included the People–Thingsdimension. Using Lowen’s Four-Pole Model, a tally was kept

e 1 State 2

t energy 6 25 Unuseddamage taken P 10 Unusednt non-urgent comms Recent urgent comms6 mean team distance 6 400 Mean team distance > 400Dev of enemy positions P 200 Unused

Table 5 Driver actions and their associations with Lowen and MBTI models.

Action Information in Information out Lowen class S N T F

Do not move none None Concrete 1 0 0 1Advance Mean enemy location Set heading toward enemy Things, Abstract 1 1 2 0Retreat Mean enemy location Set heading away from enemy Things, Abstract 1 1 2 0Spread out Mean friendly location Set heading away from teammates People, Abstract 0 2 1 1Random None Set random heading and distance Concrete 1 0 0 1

Table 3 Commander actions and their associations with Lowen and MBTI models.


Target enemy attackingteammate

ID of enemy attackingteammate

Set current target to ID input,radar lock on target

People 0 1 0 1

Target closest enemy ID of closest enemy Set current target to ID input,radar sweep mode

Things 1 0 1 0

Target last enemy to shoot me ID of last enemy to shoot me Set current target to ID input,radar sweep mode

Things 1 0 1 0

Target teammate’s target ID of teammate’s target Set current target to ID input,radar lock on target

People 0 1 0 1

Table 4 Gunner actions and their associations with Lowen and MBTI models.


Do not fire None None Concrete 1 0 0 1Fire at target Current target location Set fire heading at current target Concrete 1 0 0 1Fire at enemy group Mean enemy location Set fire heading at mean enemy location Abstract 0 1 1 0Lead target Current target location and

velocity vectorSet fire heading at predicted bulletintercept point

Abstract 0 1 1 0


of how much the particular action would activate each ofthe MBTI types. These relationships were tabulated for thecommander in Table 3.

The gunner also had a choice of actions as depicted inTable 4. Of particular note is the gunner action – ‘‘do notfire’’. It was important to include this option because firingbullets consumes robot energy, and so the gunner may notopt to fire at targets that may be too difficult to hit. Thisaction was assigned a Lowen class of ‘‘Concrete’’ becauseit can be thought of as setting the fire power to a Concretevalue of zero. The gunner actions were more appropriatelydescribed by Lowen’s class dimension Concrete–Abstract.

The driver agent had 5 actions to choose from, as de-scribed in Table 5. As with the gunner, the driver also hasan option to not move, which was classed as ‘‘Concrete’’for the same reasons as the gunner’s no-fire action. How-ever, many of the other driver actions could be describedby both the People–Things dimension as well as the Con-crete–Abstract dimension, so both classifications were usedand tallied. This raised the question of whether to somehownormalize the tally so that each action would carry the sametotal available weight. However, it was decided to forgonormalization, as this would allow the tally to serve as ameasure of how strongly the MBTI type was activated. This

decision may become clearer with a discussion of the re-ward adjustment function.

rbase¼eend�ebeginþ16henemy : Commander and Gunner

eend�ebegin�16hteammate :Driver

(

ð2Þ

After completing an action, each agent first calculates itsbase reward, rbase. For the commander and the gunner thebase reward is calculated according to the top line of Eq.2, where eend is the robot energy at the end of the action,ebegin is the robot energy at the beginning of the action,and henemy is the number of times the robot successfully hitsan enemy with bullets during the action. The factor of 16scales the number of hits to reflect the number of energypoints earned by scoring the hit. Because this energy differ-ence is already accounted for by subtracting the robot ener-gies before and after the action, this in effect doubles thenumber of reward points earned by scoring a hit. Both thecommander and the gunner have responsibilities regardingfinding and damaging the enemy, so this term was addedto encourage them to find and engage the enemy. The driverbase reward is calculated somewhat differently, accordingto the bottom line of Eq. (2). Here the energy difference


is still calculated, however the agent is penalized for thenumber of bullet hits on teammates. Because the robot doesnot understand if it has teammates between it and its tar-get, it is the driver’s responsibility to position the robot tominimize friendly fire. Again, the factor of 16 scales thenumber of hits on teammates to match the order of magni-tude of the energy difference.

radjusted ¼ rbaseð1þ ASSþ ANN þ AFF þ ATTÞ ð3Þ

The reward used for each agent in Eq. (1), the Q-learningalgorithm, was adjusted according to equation Eq. (3). Inthis equation, S, N, F, and T refer to the tally of activationfor the action that was taken, as given by Tables 3–5. Thevalues assigned to AS;AN;AF , and AT define the MBTI prefer-ence of that agent. For this analysis they took on values ofeither a 0 or a 1, with a condition that AS–AN and AF–AT.This causes an agent to only earn bonuses to its rewards ifthe actions it takes activate the same dimensions as its pref-erences dictate. This is part of the reason normalization wasforgone in favor of a tally method – to allow the agent tohave a measure of how much each of its preferences wereactivated for a given action.

Robot architecture

Each robot comprised three agents, a commander, a gun-ner, and a driver. These three agents shared the same statevalues, as the state was determined at the robot level. How-ever, the agents’ actions were not synchronized, and eachagent accessed the most current state when processing itslearning algorithm. Each agent could have a different MBTIprofile, and agents were unaware of the other agent’s ac-tion selections. The only communications between agentson a robot were implied by all of them having access tothe robot state information, radar tracking object, etc.

Team architecture

The teams under study comprised five robots. One of thesewas designated the leader, which was required by the sim-ulation framework. Leaders start the match with 200 en-ergy, and the other robots start the match with 100points. This is a constraint of the simulation frameworkimplementation and did not have any other influence onthe teams under study.

All of the robots on the team had limited communica-tions with their teammates. In particular, robots wouldbroadcast their current targets as low priority communica-tions. They would also broadcast the ID of any enemy robotthat did significant damage to them as a high priority com-munication, a sort of cry for help. These communicationswere received by all teammates, and factored into theirrespective state evaluations.

Scenarios under study

Two distinct scenarios were studied in order to evaluate theeffect of varying the personality preferences of the agents.In Study I, the team under test was matched against an en-emy team that did not have the capability to learn. It com-prised a leader that operated a radar and communicated thelocation of targets to its teammates. The teammates didnot have radars, and were given an additional 20 life points

in compensation. In Study II, the team under test wasmatched against a Q-learning team of agents as describedthus far, however the MBTI preferences were set to zeroto eliminate the effect of the reward adjustment.

In both studies, the team under test was varied system-atically in order to evaluate the effect of varying the agentpersonality preferences on team performance. In order toisolate the sensitivities, each agent’s preferences were var-ied while setting the other two agents to a ‘‘no preference’’condition. For all of the cases tested, the agent preferenceswere held constant across the entire team. For instance, ifthe commander agent was set to ST, then every robotcontained its own copy of an ST commander. All threeagents were varied in turn, and team metrics were collectedfor comparison. Because there are four distinct personalityprofiles and three agents, there were 12 distinct cases thatwere run. These are enumerated in Table 6. Each case wasreplicated 25 times in order to support statistical analysis ofthe results. The metrics used for comparison are discussedin greater detail in the next section.

Metrics

In the simulation, each Battle consists of a number ofrounds. The robots’ energies are reset and eliminated ro-bots are resurrected when a new round begins. However,the agents’ Q-tables were retained between rounds, allow-ing the agents to learn over the course of an entire Battle.For these studies, Battles consisting of 1500 rounds weresimulated. Each robot periodically stored a record contain-ing the round number, elapsed time from start of the round,and its energy level in a file. At the end of a Battle, the fileswere copied to a separate directory and reset for the nextBattle. For each case studied, 25 Battles were run. These re-cords were processed significantly to generate a meaningfulmetric for team performance.

The metric that was chosen to represent the team per-formance was an energy derivative value. This metric wascalculated as the total energy gained during a round (nega-tive if energy was lost) divided by the length of time the ro-bot fought in the round. When a robot died during theround, this metric reduced to the negative of the total en-ergy the robot started with divided by the length of timethe robot lasted. When a robot survived a round, the metricreduced to its energy gain divided by the length of theround. Because in either case the metric increases with im-proved performance, and decreases with decreased perfor-mance, it could be applied to all robots in the same way,regardless of whether they survived the round or not. Theenergy derivative value was calculated for all robots onthe team, and then averaged across the team to producea team energy derivative for each round in the Battle. Be-cause each round is started with the robots in random posi-tions, there is significant variation in team performancefrom round to round. This variation was mitigated by aver-aging the team energy derivative across 55-round learningepochs.

For each case analyzed, the 25 distinct team energyderivative curves, representing the 25 replications of thatcase, were averaged to produce a mean team energy deriv-ative for comparison with the other cases. In addition, the

Table 6 Record of cases simulated to evaluate team performance sensitivities to agent preferences.

Case Enemy Commander Gunner Driver Replications

1 Static TS – – 252 Static TN – – 253 Static FS – – 254 Static FN – – 255 Static – TS – 256 Static – TN – 257 Static – FS – 258 Static – FN – 259 Static – – TS 25

10 Static – – TN 2511 Static – – FS 2512 Static – – FN 2513 Learning TS – – 2514 Learning TN – – 2515 Learning FS – – 2516 Learning FN – – 2517 Learning – TS – 2518 Learning – TN – 2519 Learning – FS – 2520 Learning – FN – 2521 Learning – – TS 2522 Learning – – TN 2523 Learning – – FS 2524 Learning – – FN 25


standard error was calculated to determine the statisticalsignificance of the results. The mean team energy curves,with error bars to represent the standard error, will be pre-sented and discussed for the two studies in the next section.

Results and discussion

The PAL team of five robots was tested by varying the MBTIpreferences of its agents systematically in two distinct sce-narios. For these studies, the PAL team’s agents had identi-cal types across all robots on the team, although each robothad its own individual copy of each agent. In Study I, theopponent team did not learn over time and in Study II, theopponent team was identical to the PAL team, except thatall of its 15 agents were set to a ‘‘no preference’’ condition.The results from both studies are consistent with eachother, giving confidence that the results are representativeof the PAL team performance regardless of its opponentteam.

Study I: static enemy team

Fig. 4 shows the mean energy derivative metric for PALteams with the driver taking on four different MBTI prefer-ence designations and the gunner and commander set to‘‘no preference’’. Fig. 5 shows the same metric for the fourvariations of the gunner, with the driver and commander setto ‘‘no preference’’. Both figures have error bars represent-ing the standard error. The results from both sets of runs aresimilar – taken together, they show that the team perfor-mance has little dependence on the MBTI preference of

the driver and gunner, when the other agents are allowedto learn with no preferences. This result is interesting, butnot very useful for adjusting team performance by specify-ing MBTI types of its constituent agents.

Fig. 6, however, shows that there is a strong mean en-ergy derivative dependence on the commander’s type. Infact, the team performs best when the commander is anFN type, and performs much worse when the commanderis a TS type. It is interesting to note that the TS commanderwould prefer selfish actions like targeting the nearest en-emy and targeting the last enemy to damage the robot.The FN commander prefers team oriented actions likeattacking an enemy that attacked a teammate or attackingthe same enemy as a teammate. Also, TS corresponded to asweeping radar and FN corresponded to a radar locked on aspecific target. In addition, it is noteworthy that these twotypes are diametrically opposed to each other. For this sce-nario, the FS and TN types perform similarly to the FN typeand to each other, indicating limited mean energy deriva-tive dependence on variation between those two MBTItypes.

In all three figures discussed for Study I, it is apparentthat the PAL team learns to perform better over time,asymptotically approaching some ‘‘best achievable perfor-mance level’’. This indicates adequate performance of thereinforcement learning algorithm used in this study.

Study II: learning enemy team

In this study, the teams under test were competing againstan enemy team that was able to learn using Q-learning, butwithout personality preferences. Fig. 7 shows the mean

0 500 1000 1500−0.32

−0.3

−0.28

−0.26

−0.24

−0.22

−0.2

−0.18

Round Number

Mea

n En

ergy

Der

ivat

ive

Met

ric (p

oint

/tic)

D: TSD: TND: FSD: FN

Fig. 4 Study I team performance for PAL team with variationin driver personality, showing no difference in performancelevel.

0 500 1000 1500−0.32

−0.3

−0.28

−0.26

−0.24

−0.22

−0.2

−0.18

Round Number

Mea

n En

ergy

Der

ivat

ive

Met

ric (p

oint

/tic)

G: TSG: TNG: FSG: FN

Fig. 5 Study I team performance for PAL team with variationin gunner personality, showing no difference in performancelevel.

0 500 1000 1500−0.34

−0.32

−0.3

−0.28

−0.26

−0.24

−0.22

−0.2

−0.18

−0.16

Round Number

Mea

n En

ergy

Der

ivat

ive

Met

ric (p

oint

/tic)

C: TSC: TNC: FSC: FN

Fig. 6 Study I team performance for PAL team with variationin commander personality, showing significant difference inperformance level.

0 500 1000 1500−0.14

−0.13

−0.12

−0.11

−0.1

−0.09

−0.08

Round Number

Mea

n En

ergy

Der

ivat

ive

Met

ric (p

oint

/tic)

D: TSD: TND: FSD: FN

Fig. 7 Study II team performance for PAL team with variationin driver personality, showing no difference in performancelevel.

0 500 1000 1500−0.14

−0.13

−0.12

−0.11

−0.1

−0.09

−0.08

Round Number

Mea

n En

ergy

Der

ivat

ive

Met

ric (p

oint

/tic)

G: TSG: TNG: FSG: FN

Fig. 8 Study II team performance for PAL team with variationin gunner personality, showing no difference in performancelevel.

0 500 1000 1500−0.2

−0.18

−0.16

−0.14

−0.12

−0.1

−0.08

Round Number

Mea

n En

ergy

Der

ivat

ive

Met

ric (p

oint

/tic)

C: TSC: TNC: FSC: FN

Fig. 9 Study II team performance for PAL team with variationin commander personality, showing significant difference inperformance level.



energy derivative metric for PAL teams with the driver tak-ing on four different MBTI preference designations and thegunner and commander set to ‘‘no preference’’. Fig. 8shows the same metric for the four variations of the gunner,with the driver and commander set to ‘‘no preference’’.Both figures have error bars representing the standard error.The results from both sets of runs are consistent with the re-sults from Study I. They show that the team performancehas little dependence on the MBTI preference of the driverand gunner, when the other agents are allowed to learn withno preferences.

Fig. 9, again shows that there is a strong mean energyderivative dependence on the commander’s type. In thismore demanding learning environment, when the opponentis actively learning also, the superior performance of an FNcommander is even more apparent. Again, the TS com-mander performs worst, with little variation between FSand TN commander led teams.

The commander agent selected the robot strategy, and itwas again noted that the better performing commander per-sonalities were linked to team oriented strategies, ratherthan more selfish strategies. The driver and gunner actionsdid not apply at the strategic level, indicating that person-ality preferences may be more important for agents respon-sible for learning to cooperate intentionally withteammates.

An interesting aspect of the three figures representingresults from Study II, is that the learning curve appears tobe inverted. This is an artifact of the opponent team’simproving performance over time. In the beginning, neitherteam is very good at fighting, and all the robots survive for along time. This appears in the metric used as good perfor-mance, but it is due to the other team’s poor performancerather than the PAL team’s good performance at that stage.As both teams learn to fight, the survival times decrease,asymptotically approaching performance numbers thatchange very little. By the end of the simulation, both thePAL and opponent teams are quasi-static, so the metricsrepresent their relative performance without the artifactscaused by the opponent team’s learning. Although omittedfor clarity, metrics for the opponent team show the sameinverted shape, indicating that the decreasing performanceof the PAL team is not just the inverse of the opponent teamlearning to defeat it.

Conclusions

The Personality Adjusted Learner (PAL) algorithm (Recchiaet al., 2013) was adapted for use in a team of agents thatwere heterogeneous in their capabilities and responsibili-ties. The adaptation included a reformulation of the rewardadjustment equation. In addition, the PAL algorithm was re-duced to two MBTI dimensions from the original four in or-der to improve its tractability and to facilitate theaddition of an information model (Lowen, 1982) to classifyactions into the normal MBTI descriptors. This informationmodel frees the designer from assigning MBTI classifications,making the process less arbitrary and opening the possibilityof using the PAL algorithm in systems capable of discoveringnew actions on-the-fly.

The adapted PAL algorithm was implemented in a com-bat robot environment, in which each robot had three dis-tinct agents, the commander, the driver, and the gunner.Each agent was varied individually in order to study theteam performance sensitivity to agent MBTI type. For eachcase, the other two agents on each robot were held in a‘‘no preference’’ condition. In two separate studies, entail-ing static and learning opponent teams, a team perfor-mance sensitivity to the commander’s MBTI type wasdiscovered. At this point, team performance sensitivitiesdue to agent personality interactions have not been studied.A study in which the commander is held at the best-per-forming FN type while the driver and gunner are varied atthe same time is recommended as future work. This studywould determine if team performance can be improved fur-ther through MBTI manipulation of the driver and gunner.

Acknowledgement

The authors gratefully acknowledge the support provided bythe US Army Armament Research, Development, andEngineering Center (ARDEC) Science Fellowship Program toMr. Recchia.

References

Agogino, A. K., & Tumer, K. (2006). Quicr-learning for multi-agentcoordination. In Proceedings of the 21st national conference onartificial intelligence.

Alaiba, V., & Rotaru, A. (2008). Agent architecture for buildingrobocode players with swi-prolog. In Proceedings of the inter-national multiconference on computer science and informationtechnology (pp. 3–7).

Arkin, R. C. (1998). Behavior based robotics. Massachusetts Insti-tute of Technology.

Balch, T. (1997). Learning roles: Behavioral diversity in robotteams. In 1997 AAAI workshop on multiagent learning (pp. 7–12). AAAI.

Chang, Y.-H., Ho, T., & Kaelbling, L. P. (2003). All learning is local:Multi-agent learning in global reward games. In Proceedings ofneural information processing systems (NIPS-03).

Eisenstein, J. (Evolving, n.d.). Evolving robocode tank fighters.White Paper Dated October 28, 2003.

Frokjaer, J., Kristiansen, M. L., Malthesen, D., Suurland, R.,Hansen, P. B., Larsen, I. V., & Oddershede, T. (2004). Robocodedevelopment of a robocode team. Technical Report TheUniversity of Aalborg Department of Computer Science.

Kalyanakrishnan, S., & Stone, P. (2009). Learning complementarymultiagent behaviors: A case study. In Proceedings of theRoboCup international symposium 2009. Graz, Austria.

Kobayashi, K., Uchida, Y., & Watanabe, K. (2003). A study of battlestrategy for the robocode. In SICE annual conference in Fukui.

Lowen, W. (1982). Dichonomies of the mind. John Wiley and Sons.Makar, R., Mahadevan, S., & Ghavamzadeh, M. (2001). Hierarchical

multiagent reinforcement learning. In AGENTS ’01. Montreal,Quebec, Canada.

Mataric, M. (1994). Learning to behave socially. In From animals toanimats: International conference on simulation of adaptivebehavior (pp. 453–462). MIT Press.

Myers, I. B., & Myers, P. B. (1995). Gifts differing understandingpersonality type. CPP Inc.

Nidorf, D. G., Barone, L., & French, T. (2010). A comparitive studyof NEAT and XCS in robocode. Technical Report IEEE.

http://refhub.elsevier.com/S2212-683X(13)00095-9/h0005









Panait, L., & Luke, S. (2005). Cooperative multi-agent learning: Thestate of the art. Autonomous Agents and Multi-Agent Systems,3.

Recchia, T., Chung, J., & Pochiraju, K. (2013). Improving learning inrobot teams through personality assignment. BiologicallyInspired Cognitive Architectures, 3, 51–63.

Samsonovich, A. V. (2012). On a roadmap for the bica challenge.Biologically Inspired Cognitive Architectures, 1, 100–107.

Santana, H., Ramalho, G., Corruble, V., & Ratitch, B. (2004). Multi-agent patrolling with reinforcement learning. In AAMAS ’04proceedings of the third international joint conference onautonomous agents and multiagent systems. Washington, DC,USA: IEEE Computer Society.

Shichel, Y., Ziserman, E., & Sipper, M. (GP-Robocode, n.d.). GP-Robocode: Using genetic programming to evolve robocodeplayers. Technical Report Department of Computer Science,Ben-Gurion University, Israel.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning anintroduction. Massachusetts Institute of Technology.

Tangamchit, P., Dolan, J. M., & Khosla, P. K. (2002). The necessityof average rewards in cooperative multirobot learning.

Tumer, K., Agogino, A. K., & Wolpert, D. H. (2002). Learningsequences of actions in collectives of autonomous agents. InProceedings of the first international joint conference onautonomous agents and multi-agent systems (pp. 378–385).ACM Press.

Tumer, K., & Agogino, A. (2006). Agent reward shaping foralleviating traffic congestion.

Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8,279–292.

Wideman, R. M. (1998). Project teamwork, personality profiles andthe population at large: Do we have enough of the right kind ofpeople? In Proceedings of the 29th annual project managementinstitute seminar/symposium tides of change, Long Beach,California, USA, 1998.

Wolpert, D., & Tumer, K. (2001). Optimal payoff functions formembers of collectives. Advances in Complex Systems, 4, 2001.

























Date post:	23-Dec-2016
Category:	Documents
Upload:	kishore
View:	216 times
Download:	4 times

Performance of heterogeneous robot teams with personality adjusted learning

Documents