Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Presented by Alp Sardağ

Two Layer Architecture

The lower layer provides fast, short horizon decision.

The lower layer is designed to keep robot out of trouble.

The upper layer ensures that the robot continually works toward its target task or goal.

Advantages

Offers reliability. Reliability: the robot must be able to deal

with failure of sensors and actuators. Hardware failure = mission failure Example, robots operating out of direct

human control: Space exploration Office robot

The System

It has two levels of control: The lower level controls the actuators that move

the robot around and provides a set of behaviors that can be used by the higher level of control.

The upper level, planning system, plans a sequence of actions in order to move the robot from its current location to the goal.

The Architecture

The bottom level is accomplished by RL: RL as an incremental learning is able to learn

online. RL can adapt changes in the environment. RL reduce the programmer intervention.

The Architecture

The higher level is POMDP planner: POMDP planner operates quickly once a policy

is generated. POMDP planner can provide reinforcement

needed by lower level behaviors.

The Test

For test, the Kephera robot simulator is used. Kephera has limited sensors. It has well-defined environment. The simulator can run much faster than real-

time. The simulator does not require human

intervention for low battery conditions and sensor failures.

Methods for Low-Level Behaviors

Subsumption Learning from examples. Behavioral cloning.


Neural systems tend to be robust to noise and perturbation in the environment.

GeSAM is a neural network based robot hand control system. GeSAM uses adaptive neural network.

Neural networks often require long trainning periods and large amounts of data.


RL can learn continuously. RL provide adaptation to sensor drift and

changes in actuators. In many extreme cases, sensor or actuator

failures adapt enough to allow the robot to accomplish the mission.

Planning at the Top

POMDP deals with the uncertainity. For Kephera, with limited sensors, determining

the exact state is very difficult. Also, the effects of actuators may not be

deterministic.

Some rewards are associated with the goal state.

Some rewards are associated with performing some action in a certain state.

Thus, this will allow to define complex, compound goals.

Planning at the Top

Drawback

The current POMDP solution method: Does not scale well with the size of state space. Exact solutions are only feasible for very small

POMDP planning problems. Requires that the robot be given a map, which is

not always feasible.

What is Gained?

By combining RL and POMDP, the system is robust to changes.

RL will learn how to use the damaged sensors and actuators.

Continuous learning has some drawbacks when using backpropagation neural networks. Over-trainning.

POMDP adapt to sensor and actuator failures by adjusting the transition probabilities.

The Simulator

Pulse encoders are not used in this work. The simulation results can be successfully transferred to a

real robot. The sensor model includes stochastic modeling of noise and

responds similarly to the real sensors. The simulation environment includes some stochastic

modeling of wheel slippage and accelaration. Hooks are added into the simulator to allow to simulate

sensor failures. Effector failures are simulated in the code.

RL Behaviors

Three basic behavior, move forward, turn right and turn left.

The robot is always moving or performing an action.

RL is responsible for dealing: With obstacles, With adjusting sensor or actuator malfunction.

The goal of the RL module is to maximize the reward given them by the POMDP planner.

The reward is a function how long it took to make a desired state transition.

Each behaviors has its own RL module. Only one RL module can be active in a given time. Q-learning with table lookup for approximating the

value function. Fortunately, the problem so far small enough for table

lookup.

RL Behaviors

POMDP planning

Since robots can rarely determine their state from sensor observations, COMDP do not work well in many real-world robot planning tasks.

It is more adequate to use the state probability distribution, and update using the information about transition and obsservation probabilities.

Sensor Grouping

Kephera has 8 sensors that report distance values between 0 and 1024.

The observations are reduced to 16: The sensors are grouped in pairs to make 4

pseudo sensors, Tresholding applied to the output of the sensors.

POMDP planner is now robust to single sensor failures.

Solving a POMDP

Witness algorithm is used to compute the optimal policy for POMDP.

Witness does not scale well with the size of the state space.

Environment and State Space

64 possible state for the robot: 16 discrete positions. Robot’s heading is disceretized into the four compass

directions.

Sensor information was reduced to 4 bits by combining the sensors in pairs and thresholding.

Solution to LP required several days on a Sun Ultra 2 workstation.

Environment and State Space

Interface Between Layers

POMDP uses current belief state to select low level behavior to activate.

The implementation tracks the state with the highest probability: the most likely current state.

If the most likely current state changes to the state that POMDP want, a reward of 1, otherwise –1 is generated.

Hypothesis

Since RLPOMDP is adaptive, the author expect that the overall performance should degrade gracefully as sensors and actuators gradually fails.

Evaluation

State 13 is the goal state. POMDP state transition and observation

probabilities obtained by placing the robot in each 64 state and taking each action ten times.

With the policy in place,RL modules are trained in the same way.

For each system configuration (RL or hand coded), the simulation is started from every position and orientation and performance is recorded.

Metrics

Failures during trial evaluating the reliability. Average steps to goal asses the efficiency.

Gradual Sensor Failure

Battery power is used up, dust accumulates on sensors.

Intermittent Actuator Failure

Right motor control signal failed.

Conclusion

The RLPOMDP exihibits robust behavior in the presence of sensor and actuator degradation.

Future work scaling the problem. To overcome the scaling problem of table lookup of

RL, neural nets can be used (learnforget cycle). To increase the size of the space for the POMDP,

non-optimal solution algorithms are investigated. New behaviors will be added.

Date post:	06-Jan-2016
Category:	Documents
Upload:	kort
View:	27 times
Download:	0 times

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Documents