Tutorial on Universal Reinforcement Learning (AAMAS-18)

Room K22, 2pm–6pm, July 10, 2018

Abstract

Reinforcement learning (RL) is typically formalized as a Markov decision process (MDP). MDPs provide a powerful formalism, but make several restrictive assumptions. Their Markov assumption breaks if the agent interacts with other (learning) agents. Most theoretical results on MDPs further require that the agent can always recover from bad actions, also called ergodicity. Finally, it is often assumed that the number of states is finite.

Perhaps surprisingly, a rich theory can be developed for the (almost) completely general case where the environment need not be Markov, ergodic, fully observable, nor have a finite number of states. We call this setting Universal Reinforcement Learning (URL)¹. URL is as general as a POMDP with infinite number of states and unknown observation probabilities, but offers a signficantly simpler formalism.

The URL theory provides a foundational theory of AI, a formal measure of intelligence, inspiration for inventing new practical algorithms as well as analyzing existing ones, deeper insights into exploration vs. exploitation, as well as high-level predictions about the behavior of future, powerful AI algorithms.

Outline

The tutorial will be divided into three blocks:

Block 1: The URL framework
2:00–2:45pm
A gentle introduction to the general URL framework, how to learn and plan in general unknown environments, and the theoretically optimal agent AIXI (Hutter 2005).
Slides
Block 2: Hands-on Examples in the AIXIjs Framework
3:00–4:30pm (with coffee break 3:00-3:30)
AIXIjs is a recently developed web framework, illustrating concepts from the URL framework in gridworld settings (Aslanides and Leike, 2017). This block will make the theory more concrete, and make connections to various practical/deep learning algorithms.
Slides
Block 3: AI Safety Implications
4:45–5:45pm
One great benefit of a formal theory of intelligence is that it can be used to make high-level predictions about very intelligent systems (Everitt, 2018). Thus, it provides a formal grounding for AI safety research.
Slides

This tutorial will extend our AGI-16 tutorial with game theoretic implications and demonstrations from the recent AIXIjs framework (Aslanides and Leike 2016). The AGI tutorial has been published as a book chapter (Everitt and Hutter, 2018), and slides and video are available online.

Organizers

Tom Everitt is a Research Scientist at DeepMind. He has published extensively on AI safety results in the URL framework (Everitt and Hutter, 2018a,b; Everitt, Leike, and Hutter 2015; Everitt et al. 2016; Everitt and Hutter 2016; Martin, Everitt, and Hutter 2016; Everitt et al. 2017), including two award-winning papers and a PhD thesis (Everitt, 2018). He has also published on the foundations of AIXI (Everitt, Lattimore, and Hutter 2014; Alpcan, Everitt, and Hutter 2014), including a book chapter on general reinforcement learning and universal artificial intelligence (Everitt and Hutter 2018). He has also tutored the ANU course on URL and AIXI.

Website: http://www.tomeveritt.se
Email: tomeveritt@google.com
John Aslanides is a Research Engineer at Deepmind. His research interests primarily relate to the problem of efficient exploration in reinforcement learning. Prior to joining DeepMind he obtained a MSc in AI under Jan Leike and Marcus Hutter at The Australian National University, and prior to this a BSc in physics, also at the ANU. He has also worked in startups and ML & tech consulting in Australia. He is the main developer of the AIXIjs framework (Aslanides et al., 2017; Aslanides and Leike, 2017).
Marcus Hutter is professor in Artificial Intelligence at the Australian National University (ANU), and head of the ANU Intelligent Agents group.

Prof. Marcus Hutter is the author of the book Universal Artificial Intelligence (Hutter 2005), and inventor of the GRL framework and the AIXI agent. He has supervised several successful PhD students on both practical and theoretical aspects of GRL (Lattimore 2013; Nguyen 2013; Daswani 2015; Leike 2016), and has a long list of publications on GRL related topics (see (Hutter 2005; Hutter 2012; Everitt and Hutter 2018) for surveys). He also teaches the ANU course Advanced Topics in Artificial Intelligence: Foundations of AI, which covers GRL and AIXI.

Website: http://www.hutter1.net
Email: marcus.hutter@anu.edu.au

References

Alpcan, Tansu, Tom Everitt, and Marcus Hutter. 2014. “Can we measure the difficulty of an optimization problem?” In IEEE Information Theory Workshop.

Aslanides, John, and Jan Leike. 2017. “AIXIjs.” http://www.hutter1.net/aixijs/.

Aslanides, John, Jan Leike, and Marcus Hutter. “Universal Reinforcement Learning Algorithms: Survey & Experiments”, in IJCAI, 2017.

Daswani, Mayank. 2015. “Generic Reinforcement Learning Beyond Small Mdps.” PhD thesis, Research School of Computer Science, The Australian National University.

Dewey, Daniel. 2011. “Learning What to Value.” In Artificial General Intelligence, 6830:309–14.

Everitt, Tom. 2018. Towards Safe Artificial General Intelligence. PhD thesis, Australian National University. http://www.tomeveritt.se/papers/2018-thesis.pdf

Everitt, Tom, and Marcus Hutter. 2016. “Avoiding Wireheading with Value Reinforcement Learning.” In Artificial General Intelligence, 12–22. Springer.

———. 2018. “Universal Artificial Intelligence: Practical Agents and Fundamental Challengs.” In Foundations of Trusted Autonomy. Springer. http://www.tomeveritt.se/papers/UAI-book-chapter.pdf.

———. 2018a. “AGI Safety Literature Review.” In International Joint Conference on Artificial Intelligence, IJCAI. arXiv preprint 1805.01109.

———. 2018b. “The Alignment Problem for Bayesian History-Based Reinforcement Learning” Submitted. http://www.tomeveritt.se/papers/alignment.pdf.

Everitt, Tom, Daniel Filan, Mayank Daswani, and Marcus Hutter. 2016. “Self-modificication of Policy and Utility Function in Rational Agents.” In Artificial General Intelligence, 1–11. Springer.

Everitt, Tom, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. 2017. “Reinforcement Learning with a Corrupted Reward Signal.” In International Joint Conference on Artificial Intelligence, IJCAI, 4705–13.

Everitt, Tom, Tor Lattimore, and Marcus Hutter. 2014. “Free Lunch for Optimisation under the Universal Distribution.” In Proceeding of IEEE Congress on Evolutionary Computation (CEC), 167–74. IEEE.

Everitt, Tom, Jan Leike, and Marcus Hutter. 2015. “Sequential Extensions of Causal and Evidential Decision Theory.” In Algorithmic Decision Theory, edited by Toby Walsh, 205–21. Springer.

Hausknecht, Matthew, and Peter Stone. 2015. “Deep Recurrent Q-Learning for Partially Observable MDPs.” arXiv Preprint arXiv:1507.06527, 29–37. http://arxiv.org/abs/1507.06527.

Hutter, Marcus. 2005. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Lecture Notes in Artificial Intelligence (Lnai 2167). IDSIA; Springer. doi:10.1007/b138233.

———. 2009a. “Feature Dynamic Bayesian Networks.” In Proceedings of the 2nd Conference on Artificial General Intelligence, 1:7. http://arxiv.org/abs/0812.4581.

———. 2009b. “Feature Reinforcement Learning: Part I: Unstructured MDPs.” Arxiv Preprint arXiv09061713 1: 3–24. http://arxiv.org/abs/0906.1713.

———. 2012. “One Decade of Universal Artificial Intelligence.” In Theoretical Foundations of Artificial General Intelligence, edited by Pei Wang and Ben Goertzel, 4:67–88. Springer. http://arxiv.org/abs/1202.6153.

———. 2014. “Extreme state aggregation beyond MDPs.” In Algorithmic Learning Theory., 185–99. Springer.

Lattimore, Tor. 2013. “Theory of General Reinforcement Learning.” Phd thesis, Australian National University.

Lattimore, Tor, and Marcus Hutter. 2011a. “Asymptotically Optimal Agents.” In Algorithmic Learning Theory, 368–82.

———. 2011b. “No Free Lunch versus Occam’s Razor in Supervised Learning.” In Proc. Solomonoff 85th Memorial Conference. Melbourne, Australia: Springer. http://arxiv.org/abs/1111.3846.

Leike, Jan. 2016. “Nonparametric General Reinforcement Learning.” PhD thesis, Australian National University.

Leike, Jan, and Marcus Hutter. 2015. “Bad Universal Priors and Notions of Optimality.” In Conference on Learning Theory, 40:1–16.

Leike, Jan, Tor Lattimore, Laurent Orseau, and Marcus Hutter. 2016. “Thompson Sampling is Asymptotically Optimal in General Environments.” In Uncertainty in Artificial Intelligence (Uai).

Leike, Jan, Jessica Taylor, and Benya Fallenstein. 2016. “A Formal Solution to the Grain of Truth Problem.” In Uncertainty in Artificial Intelligence (UAI).

Martin, Jarryd, Tom Everitt, and Marcus Hutter. 2016. “Death and Suicide in Universal Artificial Intelligence.” In Artificial General Intelligence, 23–32. Springer. http://arxiv.org/abs/1606.00652.

Nguyen, Phuong. 2013. “Feature Reinforcement Learning Agents.” PhD thesis, Australian National University.

Orseau, Laurent, and Stuart Armstrong. 2016. “Safely interruptible agents.” In 32nd Conference on Uncertainty in Artificial Intelligence.

Orseau, Laurent, and Mark Ring. 2011. “Self-modification and mortality in artificial agents.” In Artificial General Intelligence, 1–10.

———. 2012. “Space-time embedded intelligence.” In Artificial General Intelligence, 209–18.

Ring, Mark, and Laurent Orseau. 2011. “Delusion, Survival, and Intelligent Agents.” In Artificial General Intelligence, 11–20. Springer Berlin Heidelberg.

Sutton, Richard S, and Andrew G Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.

Veness, Joel, Marc G Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. 2015. “Compress and Control.” In AAAI-15, 3016—–3023. AAAI Press. http://arxiv.org/abs/1411.5326.

Veness, Joel, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. 2011. “A Monte-Carlo AIXI approximation.” Journal of Artificial Intelligence Research 40: 95–142.

Also known as general reinforcement learning (Lattimore and Hutter 2011b; Leike 2016) and universal artificial intelligence (Hutter, 2005; Everitt and Hutter, 2018). ↩