Hierarchical Actions and Reinforcement Learning

One of the issues of reinforcement learning is how it handles hierarchical actions.

What are Hierarchical Actions?

In order to explain hierarchical actions, let us take a look at a real-world example. Consider the task of baking a sweet potato pie. The high-level action of making a sweet potato pie can be broken down into numerous low-level sub steps: cut the sweet potatoes, cook the sweet potatoes, add sugar, add flour, etc.

You will also notice that each of the low-level sub steps mentioned above can further be broken down into even further steps. For example, the task of cutting a sweet potato can be broken down into the following steps: move right arm to the right, orient right arm above the pie, bring arm down, etc.

Each of those sub steps of sub steps can then be further broken down into even smaller steps. For example, “moving right arm to the right” might involve thousands of different muscle contractions. Can you see where we are going here?

Reinforcement learning involves training a software agent to learn by experience through trial and error. A basic reinforcement learning algorithm would need to do a search over thousands of low-level actions in order to execute the task of making a sweet potato pie. Thus, reinforcement learning methods would quickly get inefficient for tasks that require a large number of low-level actions.

How to Solve the Hierarchical Action Problem

One way to solve the hierarchical action problem is to represent a high-level behavior (e.g. making a sweet potato pie) as a small sequence of high-level actions.

For example, where the solution of making a sweet potato pie might entail 1000 low-level actions, we might condense these actions into 10 high-level actions. We could then have a single master policy that switches between each of the 10 sub-policies (one for each action) every N timesteps. The algorithm explained here is known as meta-learning shared hierarchies and is explained in more detail at OpenAi.com.

We could also integrate supervised learning techniques such as ID3 decision trees. Each sub-policy would be represented as a decision tree where the appropriate action taken is the output of the tree. The input would be a transformed version of the state and reward that was received from the environment. In essence, you would have decisions taken within decisions.