This is an informal summary of our new wireheading paper.

How can we robustly make an AI do what we want it to?

A common answer is Reinforcement Learning (RL): Give the agent a reward when it does something good, and program the agent to maximise reward.

Seems straightforward. The problem is that the agent might cheat.

Consider the following example of a boat racing game. Reward is given when the boat hits targets. The faster the agent moves ahead along the track, the more reward it will get. Or so you would think. In practice, the agent cheats:

The agent gets more reward by going in circles than following the path.

Benign as it may seem, consider a more powerful AI getting reward for saving us from catastrophes. Will it silently manufacture catastrophes without us noticing? Bostrom has even gloomier scenarios.

In the paper, we study this problem in a mathematical model. Here are the main takeaways:

1. Cheating always has the same flavour

It turns out that most cheating has a common theme:

It always involves an agent getting more reward than it should.

This means most types of cheating can be captured in the same, simple mathematical model. We call it a Corrupt Reward MDP.

2. Good intentions is not enough

A natural idea is to build RL agents that maximise true reward, and that have no interest in reward obtained by cheating.

This idea requires the agent to realise whether it is cheating or not. Unfortunately, with just a reward signal to learn from, the agent has no way of learning whether it is cheating or not. Therefore, even well-intentioned agents trying not to cheat may end up cheating unintentionally.

3. Richer value learning

RL agents learn value from a single reward signal. Alternative value learning frameworks include:

  • Inverse RL (IRL), where the agent learn from watching a human act.
  • Learning values from stories (LVFS), where the agent learns from stories (news stories, movies, …).
  • Semi-supervised RL (SSRL), where the agent receives a careful evaluation of some state from a human supervisor.

In common to these frameworks is that the agent can learn the value of hypothetical future states, rather than just the current state. This permits cross-checking of value inferences, which greatly increases robustness to sensory error, and means that the risk of unintentional cheating is much reduced.

4. IRL gets close

In IRL, some cross-checking is possible. If a cheating state \(s\) seems great from one perspective, then the agent can often double check this by considering transitions to \(s\) from different states \(s’\), \(s’’\), …

Unfortunately, this only works when there are states \(s’\), \(s’’\) with “informative transitions” to \(s\). It is theoretically possible to have a cheating state \(s\) without any way for an agent to learn that \(s\) is a cheating state. Thereby, unintentional cheating may still be a problem in IRL, even though it should be much less common than in standard RL.

5. LVFS, SSRL, and combinations

Fortunately, in LVFS and SSRL all cheating states can often be discovered by the agent, and unintentional cheating can be avoided.

This is also true in combinations of RL, IRL, LVFS, and SSRL.

For example, humans tend to learn their values from a range of sources: pleasure/pain, watching other act, listening to stories, and parental supervision. These sources roughly correspond to RL, IRL, LVFS, and SSRL, respectively. One can hypothesise that a human learning from only one of these sources would be more prone to erratic behaviour and bad value judgements.

6. What to do when rich value learning not available?

In some applications, it is hard to provide rich enough data for all cheating states to be discoverable.

In these cases, quantilisation can still reduce an agent’s propensity to cheat. Essentially, when there are many more “legit” ways of obtaining reward than there are ways to cheat, then letting the agent pick a random strategy with reward above some threshold will make the agent much less likely to cheat, compared to letting the agent choose the strategy with the highest reward.

A proof-of-concept is implemented in the AIXIjs framework. See the Reward Corruption example.

7. Misc

It should be noted that our model is highly simplified, and that much further research is needed before we have a satisfactory understanding of how to make artificial agents robust. A list of open problems is provided in the paper.

If you’re worried about the agent changing it’s goal to something easier, see the paper on Self-Modification of Policy and Utility Function.