Tom Everitt

Research Scientist
Email: tomeveritt at


I'm a research scientist at DeepMind.

I'm working on AGI Safety, i.e. how we can safely build and use highly intelligent AI.
My PhD thesis Towards Safe Artificial General Intelligence is the first PhD thesis specifically devoted to this topic.
It was supervised by Marcus Hutter at the Australian National University.


Recent papers on AI safety:

Research and Publications

A full list of publications is available here and at my dblp and Google Scholar.

Below I list my papers together with some context. Many of them also appear in slightly different forms in my thesis.

Overviews and Introductions

An accessible and comprehensive overview of the emerging research field of AGI safety:

A machine learning research agenda for how to build safe AGI:

The UAI/AIXI framework is a formal model of reinforcement learning in general environments. Many of my other works are based on variations of this framework:

Gridworlds make AGI safety problems very concrete:

Many AGI safety frameworks can be modeled and compared with causal influence diagrams:

Incentives of Powerful RL Agents

The focus of my most of my work has been to understand the incentives of powerful AI systems.

General method. There is a general mehtod for inferring agent incentives directly from a graphical model.

Reward tampering. Various ideas in the AGI safety literature can be combined to form RL-like agents without significant incentives to interfere with any aspect of its reward process, be it their reward signal, their utility function, or the online training of their reward function.

If the reward signal can be (accidentally) corrupted, this paper explains why both richer feedback and randomized algorithms (quantlization) improve robustness to reward corruption.

Following up on this work, we generalize the framework of CRMDPS in the previous paper to arbitrary forms of feedback, and apply the idea of decoupled feedback to approval-directed agents in a 3D environment with integrated tampering called REALab:

Self-modification. Subtly different design choices lead to systems with or without incentives to replace their goal or utilty functions:

Self-preservation and death. AIs may have an incentive not to be turned off.

There is a natural mathematical definition of death in the UAI/AIXI framework. RL agents can be suicidal:

Extending the analysis of a previous paper, we determine the exact conditions for when CIRL agents ignore a shutdown signal:

Decision theory. Strangely, robots and other agents that are part of their environment may be able to infer properties of themselves from their own actions. For example, my having petted a lot of cats in the past may be evidence that I have toxoplasmosis, a disease which makes you fond of cats. Now, if I see a cat, should I avoid petting it to reduce the risk that I have the disease? (note that petting cats never causes toxoplasmosis). The two standard answers for how to reason in this situation are called CDT and EDT. We show that CDT and EDT turns into three possibilities for how to reason in sequential settings where multiple actions are interleaved with observations:

Other AI safety papers. An approach to solve the wireheading problem. I now believe this approach has no benefit over TI-unaware reward modeling, described in my reward tampering paper.

Reinforcement Learning

Exploration A fundamental problem in reinforcement learning is how to explore an unknown environment effectively. Ideally, an exploration strategy should direct us to regions with potentially high reward, while not being too expensive to compute. In the following paper, we find a way to employ standard function approximation techniques to estimate the novelty of different actions, which gives state-of-the-art performance in the popular Atari Learning Environment while being much cheaper to compute than most alternative strategies:

Search and Optimisation

Background. Search and optimisation are fundamental aspects of AI and of intelligence in general. Intelligence can actually be defined as optimisation ability (Legg and Hutter, Universal Intelligence: A Definition of Machine Intelligence, 2007).

(No) Free Lunch. The No Free Lunch theorems state that intelligent optimisation is impossible without knowledge about what you're trying optimise. I argue against these theorems, and show that under a natural definition of complete uncertainty, intelligent (better-than-random) optimisation is possible. Unfortunately, I was also able to show that there are pretty strong limits on how much better intelligent search can be compared to random search.

Optimisation difficulty. In a related paper, we give a formal definition of how hard a function is to optimise:

How to search. Two of the most fundamental strategies for search is DFS and BFS. In DFS, you search depth-first; for example, you follow one path until its very end before trying something else. In BFS, you instead try to search as broadly as possible, focusing on breadth rather than depth. I calculate the expected search times for both methods, and derive some results on which method is preferable in which situations:


Logic. In my Bachelor's thesis I studied logic and automated theorem proving.

Selected Talks


I have co-supervised the following students/projects:

Other web presences

Find me on Facebook, Twitter, LinkedIn, Google scholar, dblp, ORCID.