Partial Observability and Reinforcement Learning

In this post, I’m going to discuss how supervised learning can address the partial observability issue in reinforcement learning.

What is Partial Observability?

In a lot of the textbook examples of reinforcement learning, we assume that the agent, for example a robot, can perfectly observe the environment around it in order to extract relevant information about the current state. When this is the case, we say that the environment around the agent is fully observable.

However, in many cases, such as in the real world, the environment is not always fully observable. For example, there might be noisy sensors, missing information about the state, or outside interferences that prohibit an agent from being able to develop an accurate picture of the state of its surrounding environment. When this is the case, we say that the environment is partially observable.

Let us take a look at an example of partial observability using the classic cart-pole balancing task that is often found in discussions on reinforcement learning.

Below is a video demonstrating the cart-pole balancing task. The goal is to keep to keep a pole from falling over by making small adjustments to the cart support underneath the pole.

In the video above, the agent learns to keep the pole balanced for 30 minutes after 600 trials. The state of the world consists of two parts:

The pole angle
The angular velocity

However, what happens if one of those parts is missing? For example, the pole angle reading might disappear.

Also, what happens if the readings are noisy, where the pole angle and angular velocity measurements deviate significantly from the true value?

In these cases, a reinforcement learning policy that depends only on the current observation x_t(where x is the pole angle or angular velocity value and time t) will suffer in performance. This in a nutshell is the partial observability problem that is inherent in reinforcement learning techniques.

Addressing Partial Observability Using Long Short-Term Memory (LSTM) Networks

One strategy for addressing the partial observability problem (where information about the actual state of the environment is missing or noisy) is to use long short-term memory neural networks. In contrast to artificial feedforward neural networks which have a one-way flow of information from the input layer, LSTMs have feedback connections. Past information persists from run to run of the network, giving the system a “memory.” This memory can then be used to make predictions about the current state of the environment.

The details of exactly how the memory explained above is created is described in this paper written by Bram Baker of the Unit of Experimental and Theoretical Psychology at Leyden University. Baker showed that LSTM neural networks can help improve reinforcement learning policies by creating a “belief state.” This “belief state” is based on probabilities of reward, state transitions, and observations, given prior states and actions.

Thus, when the actual state (as measured by a robot’s sensor for example) is unavailable or super noisy, an agent can use belief state information generated by an LSTM to determine the appropriate action to take.