Tom Everitt

About

I'm a Staff Research Scientist at Google DeepMind.

I'm working on AGI Safety, i.e. how we can safely build and use highly intelligent AI.
My PhD thesis Towards Safe Artificial General Intelligence is the first PhD thesis specifically devoted to this topic.
After my PhD, I lead the Causal Incentives Working Group building towards a theory of alignment based on Pearlian causality, summarised in this blog post sequence.
I'm currently exploring approaches to AGI safety based on amplification of human agency.

News

Research and Publications

A full list of publications is available here and at my dblp and Google Scholar.

Below I list my papers together with some context.

Overviews and Introductions

An overview of how Pearlian causality can serve as a foundation for key AGI safety problems:

Towards Causal Foundations of Safe AGI.
By the causal incentives group
Alignmentforum, 2023.

An accessible and comprehensive overview of the emerging research field of AGI safety:

AGI Safety Literature Review
Tom Everitt, Gary Lea, and Marcus Hutter
In International Joint Conference on AI (IJCAI) and arXiv, 2018.

A machine learning research agenda for how to build safe AGI:

Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg
In arXiv and blog post , 2018. Two Minute Papers video

The UAI/AIXI framework is a formal model of reinforcement learning in general environments. Many of my other works are based on variations of this framework:

Universal Artificial Intelligence: Practical Agents and Fundamental Challenges
Tom Everitt and Marcus Hutter, 2016.
In Foundations of Trusted Autonomy and AGI-16 Tutorial slides, video.

Gridworlds make AGI safety problems very concrete:

AI Safety Gridworlds
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg
In arXiv and GitHub, 2017. Computerphile video.

Many AGI safety frameworks can be modeled and compared with causal influence diagrams:

Modeling AGI Safety Frameworks with Causal Influence Diagrams
Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg
In IJCAI AI Safety Workshop and arXiv, 2019.

Incentives of Powerful RL Agents

The focus of my most of my work has been to understand the incentives of powerful AI systems.

General method. There is a general mehtod for inferring agent incentives directly from a graphical model.

Agent Incentives: A Causal Perspective
Tom Everitt, Ryan Carey, Eric Langlois, Pedro Ortega, Shane Legg
In AAAI and arXiv, 2021.
Understanding Agent Incentives using Causal Influence Diagrams (mostly superseded by AI:ACP)
Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, Shane Legg
In arXiv and blog post, 2019. Independent Chinese translation.
The Incentives that Shape Behavior (mostly superseded by AI:ACP)
Ryan Carey, Eric Langlois, Tom Everitt, Shane Legg
In arXiv and blog post, and to be presented at the SafeAI AAAI workshop, 2020. Independent Chinese translation.

Steps towards generalizing these methods to multi-decision and multi-agent situations:

Causal Reasoning in Games
Lewis Hammond, James Fox, Tom Everitt, Alessandro Abate, Michael Wooldridge
working paper, 2022.
A Complete Criterion for Value of Information in Soluble Influence Diagrams
Chris van Merwijk, Ryan Carey, Tom Everitt
In AAAI and arXiv, 2022.
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice
Lewis Hammond, James Fox, Tom Everitt, Alessandro Abate, Michael Wooldridge
In AAMAS and arXiv, 2021.

and firming up the foundations:

Discovering Agents
Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, Tom Everitt
In arXiv, DeepMind blog, and alignmentforum, 2022.

Fairness. When is unfairness incentivised? Perhaps surprisingly, unfairness can be incentivized even when labels are completely fair:

Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness
Carolyn Ashurst, Ryan Carey, Silvia Chiappa, Tom Everitt
In AAAI and arXiv, 2022.

Reward tampering. Various ideas in the AGI safety literature can be combined to form RL-like agents without significant incentives to interfere with any aspect of its reward process, be it their reward signal, their utility function, or the online training of their reward function.

Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
Tom Everitt and Marcus Hutter
In Synthese, arXiv and blog post, 2021. Independent Chinese translation.
The Alignment Problem for Bayesian History-Based Reinforcement Learners
Tom Everitt and Marcus Hutter.
Technical report, 2018.
Winner of the AI Alignment Prize.
Path-Specific Objectives for Safer Agent Incentives
Sebastian Farquhar, Ryan Carey, Tom Everitt
In AAAI and arXiv, 2022.

If the reward signal can be (accidentally) corrupted, this paper explains why both richer feedback and randomized algorithms (quantlization) improve robustness to reward corruption.

Reinforcement Learning with Corrupted Reward Channel
Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg.
In IJCAI-17 and arXiv, 2017. Blog post, Slides, Victoria's talk.

Following up on this work, we generalize the framework of CRMDPS in the previous paper to arbitrary forms of feedback, and apply the idea of decoupled feedback to approval-directed agents in a 3D environment with integrated tampering called REALab:

REALab: An Embedded Perspective on Tampering
Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg
In arXiv and DMSR blog post, 2020. Independent Chinese translation.
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg
In arXiv and DMSR blog post, 2020. Independent Chinese translation.

Corrigibility Different RL algorithms react differently to user intervention. The differences can be analyzed with causal influence diagrams:

How RL Agents Behave when their Actions are Modified
Eric Langlois, Tom Everitt
In AAAI and arXiv, 2021.

Self-modification. Subtly different design choices lead to systems with or without incentives to replace their goal or utilty functions:

Self-Modification of Policy and Utility Function in Rational Agents
Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter.
In AGI-16 and arXiv, 2016. Slides, video.
Winner of the Kurzweil prize for best AGI paper.

Self-preservation and death. AIs may have an incentive not to be turned off.

There is a natural mathematical definition of death in the UAI/AIXI framework. RL agents can be suicidal:

Death and Suicide in Universal Artificial Intelligence
Jarryd Martin, Tom Everitt, and Marcus Hutter.
In AGI-16 and arxiv, 2016. Slides

Extending the analysis of a previous paper, we determine the exact conditions for when CIRL agents ignore a shutdown signal:

A Game-Theoretic Analysis of the Off-Switch Game
Tobias Wängberg, Mikael Böörs, Elliot Catt, Tom Everitt, and Marcus Hutter
In AGI-17 and arXiv, 2017.

Decision theory. Strangely, robots and other agents that are part of their environment may be able to infer properties of themselves from their own actions. For example, my having petted a lot of cats in the past may be evidence that I have toxoplasmosis, a disease which makes you fond of cats. Now, if I see a cat, should I avoid petting it to reduce the risk that I have the disease? (note that petting cats never causes toxoplasmosis). The two standard answers for how to reason in this situation are called CDT and EDT. We show that CDT and EDT turns into three possibilities for how to reason in sequential settings where multiple actions are interleaved with observations:

Sequential Extensions of Causal and Evidential Decision Theory.
Tom Everitt, Jan Leike, and Marcus Hutter.
In Algorithmic Decision Theory (ADT) and arXiv, 2015. Slides, source code.

Other AI safety papers. An approach to solve the wireheading problem. I now believe this approach has no benefit over TI-unaware reward modeling, described in my reward tampering paper.

Avoiding Wireheading with Value Reinforcement Learning
Tom Everitt and Marcus Hutter.
In AGI-16 and arXiv, 2016. Slides, video. Source code: download, view online.

Reinforcement Learning

Exploration A fundamental problem in reinforcement learning is how to explore an unknown environment effectively. Ideally, an exploration strategy should direct us to regions with potentially high reward, while not being too expensive to compute. In the following paper, we find a way to employ standard function approximation techniques to estimate the novelty of different actions, which gives state-of-the-art performance in the popular Atari Learning Environment while being much cheaper to compute than most alternative strategies:

Count-Based Exploration in Feature Space for Reinforcement Learning.
Jarryd Martin, Suraj Narayanan S, Tom Everitt, and Marcus Hutter
In IJCAI-17 and arXiv, 2017.

Search and Optimisation

Background. Search and optimisation are fundamental aspects of AI and of intelligence in general. Intelligence can actually be defined as optimisation ability (Legg and Hutter, Universal Intelligence: A Definition of Machine Intelligence, 2007).

(No) Free Lunch. The No Free Lunch theorems state that intelligent optimisation is impossible without knowledge about what you're trying optimise. I argue against these theorems, and show that under a natural definition of complete uncertainty, intelligent (better-than-random) optimisation is possible. Unfortunately, I was also able to show that there are pretty strong limits on how much better intelligent search can be compared to random search.

Free Lunch for Optimisation under the Universal Distribution.
Tom Everitt, Tor Lattimore, and Marcus Hutter.
In IEEE Congress on Evolutionary Computation (CEC) and arXiv, 2014. Slides.
Universal Induction and Optimisation: No Free Lunch?
Tom Everitt Supervised by Tor Lattimore, Peter Sunehag, and Marcus Hutter at ANU.
Master thesis, Department of Mathematics, Stockholm University, 2013.

Optimisation difficulty. In a related paper, we give a formal definition of how hard a function is to optimise:

Can we measure the difficulty of an optimization problem?
Tansu Alpcan, Tom Everitt, and Marcus Hutter.
In IEEE Information Theory Workshop (ITW) PDF©IEEE, 2014.

How to search. Two of the most fundamental strategies for search is DFS and BFS. In DFS, you search depth-first; for example, you follow one path until its very end before trying something else. In BFS, you instead try to search as broadly as possible, focusing on breadth rather than depth. I calculate the expected search times for both methods, and derive some results on which method is preferable in which situations:

Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part I, Tree Search.
Tom Everitt and Marcus Hutter.
In 28th Australasian Joint Conference on AI and arXiv, 2015. Slides, Source Code.
Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part II, Graph Search.
Tom Everitt and Marcus Hutter.
In 28th Australasian Joint Conference on AI and arXiv, 2015. Slides, Source Code.
Analytical Algorithm Selection for AI Search: BFS vs. DFS
Tom Everitt and Marcus Hutter.
In preparation, 2017. Source Code.

Other

Game theory. What's the key difference between Prisoner's Dilemma, Battle of the Sexes and other "standard games"? How many interestingly different 2-player games are there?

Classification by decomposition: a novel approach to classification of symmetric 2×2 games
Mikael Böörs, Tobias Wängberg, Tom Everitt & Marcus Hutter
In Theory and Decision, 2021.

Logic. In my Bachelor's thesis I studied logic and automated theorem proving.

Automated Theorem Proving.
Tom Everitt, Supervised by Rikard Bøgvad.
Bachelor thesis, Department of Mathematics, Stockholm University, 2010.

Selected Talks

Towards Causal Foundations of Safe AI
Tutorial at UAI, 2023 and AAAI, 2023
slides, video).

Other web presences

Find me on Blue Sky, Twitter, Facebook, LinkedIn, Google scholar, dblp, ORCID.