Tom Everitt

Staff Research Scientist
Google DeepMind
Email: tomeveritt at


I'm a Staff Research Scientist at Google DeepMind, leading the Causal Incentives Working Group.

I'm working on AGI Safety, i.e. how we can safely build and use highly intelligent AI.
My PhD thesis Towards Safe Artificial General Intelligence is the first PhD thesis specifically devoted to this topic.
Since then, I've been building towards a theory of alignment based on Pearlian causality, summarised in this blog post sequence.


Recent papers:

Research and Publications

A full list of publications is available here and at my dblp and Google Scholar.

Below I list my papers together with some context.

Overviews and Introductions

An overview of how Pearlian causality can serve as a foundation for key AGI safety problems:

An accessible and comprehensive overview of the emerging research field of AGI safety:

A machine learning research agenda for how to build safe AGI:

The UAI/AIXI framework is a formal model of reinforcement learning in general environments. Many of my other works are based on variations of this framework:

Gridworlds make AGI safety problems very concrete:

Many AGI safety frameworks can be modeled and compared with causal influence diagrams:

Incentives of Powerful RL Agents

The focus of my most of my work has been to understand the incentives of powerful AI systems.

General method. There is a general mehtod for inferring agent incentives directly from a graphical model.

Steps towards generalizing these methods to multi-decision and multi-agent situations: and firming up the foundations:

Fairness. When is unfairness incentivised? Perhaps surprisingly, unfairness can be incentivized even when labels are completely fair:

Reward tampering. Various ideas in the AGI safety literature can be combined to form RL-like agents without significant incentives to interfere with any aspect of its reward process, be it their reward signal, their utility function, or the online training of their reward function.

If the reward signal can be (accidentally) corrupted, this paper explains why both richer feedback and randomized algorithms (quantlization) improve robustness to reward corruption.

Following up on this work, we generalize the framework of CRMDPS in the previous paper to arbitrary forms of feedback, and apply the idea of decoupled feedback to approval-directed agents in a 3D environment with integrated tampering called REALab:

Corrigibility Different RL algorithms react differently to user intervention. The differences can be analyzed with causal influence diagrams:

Self-modification. Subtly different design choices lead to systems with or without incentives to replace their goal or utilty functions:

Self-preservation and death. AIs may have an incentive not to be turned off.

There is a natural mathematical definition of death in the UAI/AIXI framework. RL agents can be suicidal:

Extending the analysis of a previous paper, we determine the exact conditions for when CIRL agents ignore a shutdown signal:

Decision theory. Strangely, robots and other agents that are part of their environment may be able to infer properties of themselves from their own actions. For example, my having petted a lot of cats in the past may be evidence that I have toxoplasmosis, a disease which makes you fond of cats. Now, if I see a cat, should I avoid petting it to reduce the risk that I have the disease? (note that petting cats never causes toxoplasmosis). The two standard answers for how to reason in this situation are called CDT and EDT. We show that CDT and EDT turns into three possibilities for how to reason in sequential settings where multiple actions are interleaved with observations:

Other AI safety papers. An approach to solve the wireheading problem. I now believe this approach has no benefit over TI-unaware reward modeling, described in my reward tampering paper.

Reinforcement Learning

Exploration A fundamental problem in reinforcement learning is how to explore an unknown environment effectively. Ideally, an exploration strategy should direct us to regions with potentially high reward, while not being too expensive to compute. In the following paper, we find a way to employ standard function approximation techniques to estimate the novelty of different actions, which gives state-of-the-art performance in the popular Atari Learning Environment while being much cheaper to compute than most alternative strategies:

Search and Optimisation

Background. Search and optimisation are fundamental aspects of AI and of intelligence in general. Intelligence can actually be defined as optimisation ability (Legg and Hutter, Universal Intelligence: A Definition of Machine Intelligence, 2007).

(No) Free Lunch. The No Free Lunch theorems state that intelligent optimisation is impossible without knowledge about what you're trying optimise. I argue against these theorems, and show that under a natural definition of complete uncertainty, intelligent (better-than-random) optimisation is possible. Unfortunately, I was also able to show that there are pretty strong limits on how much better intelligent search can be compared to random search.

Optimisation difficulty. In a related paper, we give a formal definition of how hard a function is to optimise:

How to search. Two of the most fundamental strategies for search is DFS and BFS. In DFS, you search depth-first; for example, you follow one path until its very end before trying something else. In BFS, you instead try to search as broadly as possible, focusing on breadth rather than depth. I calculate the expected search times for both methods, and derive some results on which method is preferable in which situations:


Game theory. What's the key difference between Prisoner's Dilemma, Battle of the Sexes and other "standard games"? How many interestingly different 2-player games are there? Logic. In my Bachelor's thesis I studied logic and automated theorem proving.

Selected Talks

Other web presences

Find me on Twitter, Facebook, LinkedIn, Google scholar, dblp, ORCID.