Tom Everitt

Research Scientist
Email: tomeveritt at


I'm a Research Scientist at DeepMind.

I'm working on AI Safety, i.e. how we can safely build and use highly intelligent AI. My PhD thesis Towards Safe Artificial General Intelligence supervised by Marcus Hutter at Australian National University is the first PhD thesis in the world that is specifically devoted to this topic.


Recent papers on AI safety:

Research and Publications

A full list of publications is available here and at my Google Scholar. Below I list my papers together with some context. Many of them also appear in slightly different forms in my thesis.

AI Safety

Background. The following literature review gives an accessible and comprehensive overview of the emerging research field of AGI safety:

An easy way to get an AI to do what we want it to do is to build it to optimise a reward signal. We give the AI reward when it does something good, and no reward if it does something bad (carrot and stick principle, aka reinforcement learning). By construction, all the AI want is to optimise the reward signal, which means that the AI will want to do things that we think are good so that we give it reward.

My book chapter and AIXI tutorial offer gentle introductions to the formal model I use in my other works:

Misalignment. There are a wide range of things that can go wrong when using reinforcement learning to control a powerful AI. My alignment paper categorizes the different ways, and describes ways in which the problems may be mitigated. (It's somewhat long, but also fairly accessibly written.)

Self-modification. One risk is that the AI modifies itself to get an easier goal than optimising reward given by capricious humans. It turns out that depending on subtle details in how the AI is constructed, self-modification may be a risk or not:

Wireheading. There is also a risk that AI finds a way to counterfeit reward by modifying its reward sensor to always report maximum reward. This is called wireheading.

Self-preservation and death. Another risk with AIs is the self-preservation drive. Much like humans and animals, AIs may have a strong desire not to be shut off or being terminated. This self-preservation drive may lead the AI to hide its true powers (so that we don't become scared of it and shut it off).

There is a natural mathematical definition of death in the Universal Artificial Intelligence framework. It turns out that there are both self-preserving and suicidal RL agents:

For cooperative IRL agents, the self-preservation drive is more balanced, with the self-preservation drive often following the desires of the human supervisor:

Decision theory. Strangely, robots and other agents that are part of their environment may be able to infer properties of themselves from their own actions. For example, my having petted a lot of cats in the past may be evidence that I have toxoplasmosis, a disease which makes you fond of cats. Now, if I see a cat, should I avoid petting it to reduce the risk that I have the disease? (note that petting cats never causes toxoplasmosis). The two standard answers for how to reason in this situation are called CDT and EDT. We show that CDT and EDT turns into three possibilities for how to reason in sequential settings where multiple actions are interleaved with observations:

Reinforcement Learning

Exploration A fundamental problem in reinforcement learning is how to explore an unknown environment effectively. Ideally, an exploration strategy should direct us to regions with potentially high reward, while not being too expensive to compute. In the following paper, we find a way to employ standard function approximation techniques to estimate the novelty of different actions, which gives state-of-the-art performance in the popular Atari Learning Environment while being much cheaper to compute than most alternative strategies:

Search and Optimisation

Background. Search and optimisation are fundamental aspects of AI and of intelligence in general. Intelligence can actually be defined as optimisation ability (Legg and Hutter, Universal Intelligence: A Definition of Machine Intelligence, 2007).

(No) Free Lunch. The No Free Lunch theorems state that intelligent optimisation is impossible without knowledge about what you're trying optimise. I argue against these theorems, and show that under a natural definition of complete uncertainty, intelligent (better-than-random) optimisation is possible. Unfortunately, I was also able to show that there are pretty strong limits on how much better intelligent search can be compared to random search.

Optimisation difficulty. In a related paper, we give a formal definition of how hard a function is to optimise:

How to search. Two of the most fundamental strategies for search is DFS and BFS. In DFS, you search depth-first; for example, you follow one path until its very end before trying something else. In BFS, you instead try to search as broadly as possible, focusing on breadth rather than depth. I calculate the expected search times for both methods, and derive some results on which method is preferable in which situations:


Logic. In my Bachelor's thesis I studied logic and automated theorem proving.

Selected Talks


I have co-supervised the following students/projects:

Other web presences

Find me on Facebook, Twitter, Google+, LinkedIn, Google scholar, ORCID.