Tom EverittResearch Scientist DeepMind Email: tomeveritt at google.com
I'm a research scientist at DeepMind.
I'm working on AGI Safety, i.e. how we can safely build and use highly intelligent AI. My PhD thesis Towards Safe Artificial General Intelligence is the first PhD thesis specifically devoted to this topic. It was supervised by Marcus Hutter at the Australian National University.
REALab: Conceptualising the Tampering Problem Ramana Kumar, Jonathan Uesato, Tom Everitt, Victoria Krakovna, Richard Ngo, Shane Legg Paper 1: REALab and Paper 2: Decoupled Approval on arXiv and DMSR blog post 2020. Independent Chinese translation.
Below I list my papers together with some context. Many of them also appear in slightly different forms in my thesis.
An accessible and comprehensive overview of the emerging research field of AGI safety:
A machine learning research agenda for how to build safe AGI:
Scalable agent alignment via reward modeling: a research direction Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg In arXiv and blog post , 2018. Two Minute Papers video
The UAI/AIXI framework is a formal model of reinforcement learning in general environments. Many of my other works are based on variations of this framework:
Gridworlds make AGI safety problems very concrete:
The focus of my most of my work has been to understand the incentives of powerful AI systems.
General method. There is a general mehtod for inferring agent incentives directly from a graphical model.
Understanding Agent Incentives using Causal Influence Diagrams Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, Shane Legg In arXiv and blog post, 2019. Independent Chinese translation.
Reward tampering. Various ideas in the AGI safety literature can be combined to form RL-like agents without significant incentives to interfere with any aspect of its reward process, be it their reward signal, their utility function, or the online training of their reward function.
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective Tom Everitt and Marcus Hutter In arXiv and blog post, 2019. Independent Chinese translation.
If the reward signal can be (accidentally) corrupted, this paper explains why both richer feedback and randomized algorithms (quantlization) improve robustness to reward corruption.
Following up on this work, we generalize the framework of CRMDPS in the previous paper to arbitrary forms of feedback, and apply the idea of decoupled feedback to approval-directed agents in a 3D environment with integrated tampering called REALab:
Avoiding Tampering Incentives in Deep RL via Decoupled Approval Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg In arXiv and DMSR blog post, 2020. Independent Chinese translation.
Self-modification. Subtly different design choices lead to systems with or without incentives to replace their goal or utilty functions:
Self-Modification of Policy and Utility Function in Rational Agents Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. In AGI-16 and arXiv, 2016. Slides, video. Winner of the Kurzweil prize for best AGI paper.
Self-preservation and death. AIs may have an incentive not to be turned off.
There is a natural mathematical definition of death in the UAI/AIXI framework. RL agents can be suicidal:
Extending the analysis of a previous paper, we determine the exact conditions for when CIRL agents ignore a shutdown signal:
Decision theory. Strangely, robots and other agents that are part of their environment may be able to infer properties of themselves from their own actions. For example, my having petted a lot of cats in the past may be evidence that I have toxoplasmosis, a disease which makes you fond of cats. Now, if I see a cat, should I avoid petting it to reduce the risk that I have the disease? (note that petting cats never causes toxoplasmosis). The two standard answers for how to reason in this situation are called CDT and EDT. We show that CDT and EDT turns into three possibilities for how to reason in sequential settings where multiple actions are interleaved with observations:
Other AI safety papers. An approach to solve the wireheading problem. I now believe this approach has no benefit over TI-unaware reward modeling, described in my reward tampering paper.
Exploration A fundamental problem in reinforcement learning is how to explore an unknown environment effectively. Ideally, an exploration strategy should direct us to regions with potentially high reward, while not being too expensive to compute. In the following paper, we find a way to employ standard function approximation techniques to estimate the novelty of different actions, which gives state-of-the-art performance in the popular Atari Learning Environment while being much cheaper to compute than most alternative strategies:
Background. Search and optimisation are fundamental aspects of AI and of intelligence in general. Intelligence can actually be defined as optimisation ability (Legg and Hutter, Universal Intelligence: A Definition of Machine Intelligence, 2007).
(No) Free Lunch. The No Free Lunch theorems state that intelligent optimisation is impossible without knowledge about what you're trying optimise. I argue against these theorems, and show that under a natural definition of complete uncertainty, intelligent (better-than-random) optimisation is possible. Unfortunately, I was also able to show that there are pretty strong limits on how much better intelligent search can be compared to random search.
Universal Induction and Optimisation: No Free Lunch? Tom Everitt Supervised by Tor Lattimore, Peter Sunehag, and Marcus Hutter at ANU. Master thesis, Department of Mathematics, Stockholm University, 2013.
Optimisation difficulty. In a related paper, we give a formal definition of how hard a function is to optimise:
How to search. Two of the most fundamental strategies for search is DFS and BFS. In DFS, you search depth-first; for example, you follow one path until its very end before trying something else. In BFS, you instead try to search as broadly as possible, focusing on breadth rather than depth. I calculate the expected search times for both methods, and derive some results on which method is preferable in which situations:
Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part I, Tree Search. Tom Everitt and Marcus Hutter. In 28th Australasian Joint Conference on AI and arXiv, 2015. Slides, Source Code.
Analytical Results on the BFS vs. DFS Algorithm Selection Problem. Part II, Graph Search. Tom Everitt and Marcus Hutter. In 28th Australasian Joint Conference on AI and arXiv, 2015. Slides, Source Code.
Automated Theorem Proving. Tom Everitt, Supervised by Rikard Bøgvad. Bachelor thesis, Department of Mathematics, Stockholm University, 2010.
Lectures on agent incentive theory and causal influence diagrams Human-aligned AI Summer School, 2019.
Tutorial on Universal Reinforcement Learning AAMAS 2018 Tutorial info and slides
AGI Safety and Understanding Invited talk, AGI-17. Slides.
AI and the future -- Introduction to AI Safety. ANU Regnet, ANU Learning Communities, ANU XSA (XSAC talk), LessWrong Canberra, 2016 and 2017 Slides
AI Safety -- Overview of recent models and results. Effective Altruism Sydney retreat 2016 Slides
Jarryd Martin, Master of Computing (ANU) Thesis: Optimism and Death in Reinforcement Learning Papers: Count-Based Exploration in Feature Space (IJCAI-17), and Death and Suicide in Universal Artificial Intelligence (AGI-16)
Suraj Narayanan S, Master of Computing (ANU) Thesis: Exploration in Feature Space Paper: Count-Based Exploration in Feature Space (IJCAI-17)
Tobias Wängberg and Mikael Böörs, Bachelor of Mathematics (LiU) Thesis: Classification by Decomposition: A Partitioning of the Space of 2x2 Symmetric Games Paper: A Game-Theoretic Analysis of the Off-Switch Game (AGI-17)