权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Permutation based task transfer for genetic programming

基于排列的遗传编程任务转移

基本信息

批准号：
RGPIN-2015-06117
负责人：
Heywood, Malcolm
金额：
$ 1.31万
依托单位：
Dalhousie University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2019
资助国家：
加拿大
起止时间：
2019-01-01 至 2020-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=676926
关键词：
Permutation based task transfer genetic

项目摘要

The general context for this research proposal is that of genetic programming (GP) as applied to learning decision making policies for agents operating in environments with delayed payoff (or reinforcement learning). The specific focus of the proposal lies in developing a framework for systematically scaling GP to more difficult versions of tasks with delayed payoff than have previously been considered. In particular we are interested in scenarios in which solutions for a simpler initial `source' task are then `transferred' to a more difficult but related (target) task; or a form of transfer learning. The insight of this work is to make use of the capability of GP to identify solutions that make use of subsets of state variables. This then provides the basis for redeploying solutions discovered under the source task such that more difficult tasks can be solved. The potential benefits of adopting such an approach are that: 1) it is not necessary to continuously rediscover policies from scratch; 2) increased success in / or better solutions to the ultimate target task; and 3) lower computational overhead as measured against finding solutions to each task.******Two specific target domains will be used to illustrate the approach: 1) learning policies to play soccer in the continuous valued simulated 2D world of RoboCup keepaway; 2) learning general policies for solving the 3 by 3 Rubik cube. The keepway soccer task represents a benchmark for multi-agent learning, hence has a history of previous results as well as posing tasks of incrementally increasing difficulty. The Rubik cube task has had little previous history as a benchmark for learning algorithms of any form. Instead solutions have taken the form of deploying some form of exhaustive search. Both tasks represent examples of complex task domains that have very large state-spaces (potential number of legal states), but possess underlying properties (regularities) that GP should be able to discover specific instances of. The basic hypothesis of this research is that once an instance of a context dependent strategy is identified for solving some subset of an initial task, then we should be able to use this as the basis for generalizing to many more instances of the task through a deterministic process of variation in the policy's references to the state variables. Naturally, for the process to scale, we need to avoid artificially introducing pathologies into the search. In short, the sum of initial policies from which a later policy is constructed needs to exceed the mere sum of its parts.***Success in these objectives would provide a general framework for scaling GP to a wide range of tasks with delayed payoff. Such tasks are of widespread interest to the GP community because they represent some of the most expensive, if not the most expensive, set of task domains for applying GP to. Moreover, the two task domains are widely acknowledged to be of a particularly challenging nature.**

本研究提案的总体背景是将遗传规划（GP）应用于在延迟支付环境中操作的代理（或强化学习）的学习决策策略。该提案的具体重点在于开发一个框架，以便系统地将GP扩展到比以前考虑的更困难的具有延迟回报的任务版本。我们特别感兴趣的场景是，一个简单的初始“源”任务的解决方案，然后“转移”到一个更困难但相关的（目标）任务；或者是一种迁移学习。这项工作的洞察力是利用GP的能力来识别利用状态变量子集的解决方案。然后，这为重新部署在源任务下发现的解决方案提供了基础，从而可以解决更困难的任务。采用这种方法的潜在好处是：1)不需要不断地从头开始重新发现政策；2)提高最终目标任务的成功率或更好的解决方案；3)相对于寻找每个任务的解决方案，计算开销更低。******将使用两个特定的目标域来说明该方法：1)在RoboCup keepaway的连续数值模拟二维世界中学习踢足球的策略；2)学习解3 × 3魔方的一般策略。keepway足球任务代表了多智能体学习的基准，因此具有先前结果的历史，以及提出难度逐渐增加的任务。魔方任务之前几乎没有作为任何形式的学习算法的基准。相反，解决方案采取了部署某种形式的穷举搜索的形式。这两个任务都代表了复杂任务域的示例，这些任务域具有非常大的状态空间（合法状态的潜在数量），但具有GP应该能够发现特定实例的底层属性（规则）。本研究的基本假设是，一旦上下文依赖策略的实例被确定用于解决初始任务的某些子集，那么我们应该能够将其作为基础，通过策略对状态变量引用的变化的确定性过程，推广到更多的任务实例。自然地，为了扩大这个过程的规模，我们需要避免人为地在搜索中引入病态。简而言之，构建后续策略的初始策略的总和需要超过其部分的总和。***这些目标的成功将为将GP扩展到具有延迟回报的广泛任务提供一个总体框架。这样的任务引起GP社区的广泛兴趣，因为它们代表了应用GP的一些最昂贵（如果不是最昂贵的话）的任务域集。此外，这两个任务领域被广泛认为具有特别具有挑战性的性质