权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Hierarchical reinforcement learning in large-scale domains

大规模领域的分层强化学习

基本信息

批准号：
2120604
负责人：
金额：
--
依托单位：
University of Bath
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2018
资助国家：
英国
起止时间：
2018 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2120604
关键词：
Hierarchical reinforcement learning large scale

项目摘要

Temporal abstraction is valuable for an intelligent system. Representing knowledge over multiple timescales provides a means of partitioning state space, which can accelerate learning and allow behaviour to be transferred to different tasks. Humans constantly plan and behave using temporally extended actions, breaking any particular task down into a sequence of salient waypoints, or subgoals. Hierarchical reinforcement learning rests upon a set of theoretically sound approaches for learning and planning using temporally extended actions [1], [2], [3]. Despite us having richly expressive frameworks for utilising a given hierarchy of action, a major problem that remains is how we may autonomously discover the hierarchical structure of a given domain. This problem is known as skill discovery. There exist many good approaches to skill discovery, with some based on graph theory, some on mining the trajectories of a reinforcement learning agent's experience, and others on gradient based, end-to-end optimisation. However, the current methods are not immediately compatible with all types of problems, and have not been demonstrated to scale well.Rubik's cube is an iconic puzzle that has a reputation for being difficult to solve for someone without prior knowledge. There are currently no solutions that use reinforcement learning starting from an arbitrary scrambled state. An obvious element of the problem is the need to simultaneously satisfy competing objectives, i.e. by correctly placing the different pieces of the cube. Another key source of difficulty is due to the property of non-serialisable subgoals: using a sequence of subgoals to arrive at the solution, some previous subgoals must be temporarily violated before reaching further ones. Whilst it is known that Rubik's cube can be solved in 20 moves or less from any of its 43 quintillion states, 'cubists' who can solve the cube typically use many more moves by employing a variety of macro operators [4]. These macro operators leave part of a state invariant to their effects, which allows cubists to manipulate only certain parts of Rubik's cube during each stage of their solve. This research will focus on the question of how a reinforcement learning agent may learn a hierarchical policy for Rubik's cube. Preliminary work undertaken has identified a key property of the state space. Possible future directions could address the discovery of macro operators from direct experience, develop ways to restrict initiation sets, and utilise symmetries of the problem. Careful consideration will be needed to design effective methods of function approximation, both at the top-level of control and also for the temporarily extended actions. Beyond the Rubik's cube there are many permutation puzzles that can also be solved through methods this research will create. More generally, combinatorial optimisation problems are widespread throughout science and engineering, and are increasingly being addressed using reinforcement learning [5]. The aim is to incorporate methods arising from this research into this wider body of work. [1] Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems: Proceedings of the 10th Conference, Denver. Cambridge, MA: MIT Press. [2] Sutton, R. S., Precup, D., and Singh, S. 1999. Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp.181-211.[3] Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, pp. 227-303.[4] Korf, R. 1985. Macro-operators: A weak method for learning. Artificial Intelligence, 35, pp. 35-77.[5] Yanjun, L., Hengtong, K., Ketian, Y., Shuyu, Y., and Xiaolin, L. 2018. FoldingZero: Protein Folding from Scratch in Hydrophobic-Polar Model. In Advances in Neural Information Processing Systems

时间抽象对于智能系统来说是有价值的。在多个时间尺度上表示知识提供了一种划分状态空间的方法，这可以加速学习并允许将行为转移到不同的任务。人类不断地使用时间延长的行动来计划和行动，将任何特定的任务分解成一系列显著的路点或子目标。分层强化学习依赖于一套理论上合理的方法，用于使用时间扩展的动作进行学习和规划[1]、[2]、[3]。尽管我们有丰富的表达框架来利用给定的操作层次结构，但仍然存在的一个主要问题是，我们如何自主地发现给定域的层次结构。这个问题被称为技能发现。有许多很好的技能发现方法，一些基于图论，一些基于强化学习代理经验的轨迹挖掘，还有一些基于梯度的端到端优化。然而，目前的方法并不能立即兼容所有类型的问题，也没有被证明具有很好的伸缩性。魔方是一个标志性的谜题，对于没有事先知识的人来说，它有很难解决的名声。目前还没有使用从任意加扰状态开始的强化学习的解决方案。问题的一个明显因素是需要同时满足相互竞争的目标，即通过正确放置立方体的不同部分。另一个关键的困难来源是不可序列化的子目标的性质：使用一系列的子目标来达到解决方案，在达到进一步的子目标之前，必须暂时违反先前的一些子目标。虽然众所周知，魔方可以在43个五分之一个状态中的任何一个状态下用20步或更少的步来求解，但能够求解魔方的立体主义者通常通过使用各种宏运算符来使用更多的步[4]。这些宏运算符使状态的一部分保持不变，这允许立体主义者在求解的每个阶段只操作魔方的某些部分。这项研究将集中在强化学习代理如何学习魔方的分层策略的问题上。已开展的初步工作确定了状态空间的一个关键性质。未来可能的方向可能是解决从直接经验中发现宏观操作符的问题，开发限制启动集的方法，并利用问题的对称性。需要仔细考虑设计有效的函数逼近方法，无论是在控制的顶层，还是对于临时扩展的动作。除了魔方之外，还有许多排列谜题也可以通过这项研究创造的方法来解决。更广泛地说，组合优化问题在整个科学和工程中广泛存在，并越来越多地使用强化学习来解决。其目的是将这项研究产生的方法纳入这一更广泛的工作主体。[1]Parr，R.和Russell，S.1998。机器分层强化学习。神经信息处理系统进展：第10届会议论文集，丹佛。马萨诸塞州剑桥：麻省理工学院出版社。[2]Sutton，R.S.，Precup，D.和Singh，S.1999。在MDP和半MDP之间：强化学习中的时间抽象框架。《人工智能》，第112页，第181-211页。基于MaxQ值函数分解的分层强化学习。《人工智能研究杂志》，第13页，第227-303页。宏运算符：一种薄弱的学习方法。人工智能，35，35-77页。[5]严军，L.，恒通，K.，柯田，Y.，舒宇，Y.，晓琳，L.2018.FoldingZero：在疏水-极性模型中蛋白质从头开始折叠。神经信息处理系统研究进展