回报函数学习的学徒学习综述被引量：2

Survey of apprenticeship learning based on reward function learning

在线阅读下载PDF

导出

摘要通过研究基于回报函数学习的学徒学习的发展历史和目前的主要工作,概述了基于回报函数学习的学徒学习方法.分别在回报函数为线性和非线性条件下讨论,并且在线性条件下比较了2类方法——基于逆向增强学习(IRL)和最大化边际规划(MMP)的学徒学习.前者有较为快速的近似算法,但对于演示的最优性作了较强的假设;后者形式上更易于扩展,但计算量大.最后,提出了该领域现在还存在的问题和未来的研究方向,如把学徒学习应用于POMDP环境下,用PBVI等近似算法或者通过PCA等降维方法对数据进行学习特征的提取,从而减少高维度带来的大计算量问题. This paper focuses on apprenticeship learning, based on reward function learning. Both the historical basis of this field and a broad selection of current work were investigated. In this paper, two kinds of algorithm--apprenticeship learning methods based on inverse reinforcement learning （IRL） and maximum margin planning （MMP） frameworks were discussed under respective assumptions of linear and nonlinear reward functions. Comparison was made under the linear assumption conditions. The former can be implemented with an efficient approximate method but has made a strong supposition of optimal demonstration. The latter has a relatively easy to extend form but may take large amounts of computation. Finally, some suggestions were given for further research in reward function learning in a partially observable Markov decision process （POMDP） environment and in continuous/ high dimensional space, using either an approximate algorithm such as point-based value iteration （PBVI） or a feature abstraction algorithm using dimension reduction methods such as principle component analysis （PCA）. Adopting these may alleviate the curse of dimensionality.

作者金卓军钱徽陈沈轶朱淼良

机构地区浙江大学计算机学院

出处《智能系统学报》 2009年第3期208-212,共5页 CAAI Transactions on Intelligent Systems

基金国家自然科学基金资助项目(90820306) 浙江省科技厅重大资助项目(006c13096)

关键词学徒学习回报函数逆向增强学习最大化边际规划 apprenticeship learning reward function inverse reinforcement learning maximum margin planning

分类号 TP242 [自动化与计算机技术—检测技术与自动化装置]

作者简介金卓军，男，1984年生，博士研究生，主要研究方向为机器学习．通信作者：钱徽．E—mail：qianhui@zju．edu．cn．钱徽，男，1974年生，副教授，人工智能学会智能机器人专业委员会委员，主要研究方向为人工智能、计算机视觉．陈沈轶，男，1980生，博士研究生，主要研究方向为机器学习．

引文网络
相关文献

参考文献22

1ATKESON C G,SCHAAL S.Robot learning from demonstration[C]//Proceedings of the Fourteenth International Conference on Machine Learning.Nashville,USA,1997:12-20.
2RATLIFF N D,BAGNELL J A,ZINKEVICH M A.Maximum margin planning[C]//Proceedings of the 23rd International Conference on Machine Learning.Pittsburgh,USA,2006:729-736.
3金卓军,钱徽,陈沈轶,朱淼良.基于回报函数逼近的学徒学习综述[J].华中科技大学学报（自然科学版）,2008,36(S1):288-290. 被引量：2
4NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the Seventeenth International Conference on Machine Learning.San Francisco,USA,2000:663-670.
5ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the Twenty-first International Conference on Machine Learning.Banff,Canada,2004:1-8.
6KOLTER J Z,ABBEEL P,NG A Y.Hierarchical apprenticeship learning with application to quadruped locomotion[C]//Advances in Neural Information Processing Systems.Cambridge,USA:MIT Press,2008.
7RATLIFF N,BAGNELL J A,ZINKEVICH M A.Subgradient methods for maximum margin structured learning[C]//Workshop on Learning in Structured Outputs Spaces at ICML.Pittsburgh,USA,2006.
8SYED U,BOWLING M,SCHAPIRE R E.Apprenticeship learning using linear programming[C]//Proceedings of the 25 International Conference on Machine Learning (ICML 2008).Helsinki,Finland,2008:1032-1039.
9SYED U,SCHAPIRE R E.A game-theoretic approach to apprenticeship learning[C]//Advances in Neural Information Processing Systems.Cambridge,USA:MIT Press,2008.
10GRIMES D B,RAJESH D R,RAO R P N.Learning nonparametric models for probabilistic imitation[C]//Proceedings of Neural Information Processing Systems.Cambridge,USA:MIT Press,2007:521-528.

二级参考文献10

1Ratliff D N,Bagnell J A,Zinkevich M.Maximummargin planning[].Proceedings of the rd Inter-national Conference on Machine Learning.2006
2Ng Y,Russell J S.Algorithms for inverse reinforce-ment learning[].Proceedings of the SeventeenthInternational Conference on Machine Learning.2000
3Abbeel P Y,Ng Y.Apprenticeship learning via in-verse reinforcement learning[].Proceedings of theTwenty-first International Conference on MachineLearning.2004
4Kolter J Z,Abbeel P Y,Ng A.Hierarchical appren-ticeship learning with application to quadruped loco-motion[].Advances in Neural Information Process-ing Systems.2008
5Taskar B,Lacoste-Julien S,Jordan M.Structuredprediction via the extragradient method[].Proceed-ings of Neural Information Processing Systems.2005
6Pieter A,Andrew Y N.Exploration and apprentice-ship learning in reinforcement learning[].Proceed-ings of the nd International Conference on MachineLearning.2005
7Kolter J Z,Rodgers M P,Ng A Y.A complete con-trol architecture for quadruped locomotion over roughterrain[].Proceedings of the International Confer-ence on Robotics and Automation.2008
8Rebula J R,Neuhaus P D,Bonnlander B V,et al.Acontroller for the littledog quadruped walking onrough terrain[].IEEE International Conference onRobotics and Automation.2007
9Ratliff N,Bagnell J A,Srinivasa S.Imitation learn-ing for locomotion and manipulation. CMU-RI-TR-07-45 . 2007
10Kaelbling L P,Littman M L,Moore A W.Rein-forcement learning:a survey[].Journal of Artifi-cial Intelligence Research.1996

共引文献1

1宋莉,李大字,徐昕.逆强化学习算法、理论与应用研究综述[J].自动化学报,2024,50(9):1704-1723. 被引量：1

同被引文献31

1黄炳强,曹广益,王占全.强化学习原理、算法及应用[J].河北工业大学学报,2006,35(6):34-38. 被引量：19
2Nakada H, Hirofuchi T. Toward virtual machine packing optimization based on generic algorithm[C] //Proc of the 10th International Work-Conference on Artificial Neural Networks:Part Ⅱ:Distributed Computing and Ambient Assisted Living. 2009:651-654.
3Nakada H, Hirofuchi T. Eliminating datacenter idle power with dynamic and intelligent VM relocation[C] //Advances in Intelligent and Soft Computing. 2010:645-648.
4Skiena S S. The algorithm design manual[M] . Berlin:Springer, 2008:595-598.
5De la Vega W F, Lueker G S. Bin packing can be solved within epsilon in linear time[J] . Combinatorica, 1981, 1(4):349-355.
6Hyser C, Mckee B, Gardner R, et al. Autonomic virtual machine placement in the data center, HPL-2007-189[R] . [S. l.] :HP Labs, 2007.
7Padala P, Hou Kaiyuan, Shin K G, et al. Automated control of multiple virtualized resources[M] . Nuremberg:ACM EuroSys, 2009:13-26.
8Mylavarapu S, Sukthankar V, Banejee P. An optimized capacity planning approach for virtual infrastructure exhibiting stochastic workload[C] //Proc of ACM Symposium on Applied Computing. 2010:386-390.
9Zhou Wenyu, Yang Shonbao, Fang Jun, et al. VMC-tune:a load balancing scheme for virtual machine cluster based on dynamic resource allocation[C] //Proc of International Conference on Grid and Cooperative Computing. 2010:81-86.
10Wood T, Shenoy P, Venkataramani A, et al. Sandpiper:black-box and gray-box resource management for virtual machines[J] . Computer Networks, 2009, 53 (17):2923-2938.

引证文献2

1薛涛,刘龙.云计算中虚拟机资源自动配置技术的研究[J].计算机应用研究,2016,33(3):759-764. 被引量：12
2林嘉豪,章宗长,姜冲,郝建业.基于生成对抗网络的模仿学习综述[J].计算机学报,2020,43(2):326-351. 被引量：23

二级引证文献35

1杜文风,王英奇,王辉,赵艳男,高博青,董石麟.基于边界平衡生成对抗网络的十字板式节点新构形智能生成方法[J].建筑结构学报,2022,43(S01):315-324. 被引量：3
2李林.人工智能生成内容的艺术性研究[J].大众文艺（学术版）,2020(1):98-99. 被引量：2
3陈臣.基于云计算的图书馆大数据分析和决策支持平台构建[J].图书馆理论与实践,2016,0(5):101-104. 被引量：19
4周扬,龚畅,徐平平.基于动态目标遗传算法的云计算工作流调度方法[J].湘潭大学自然科学学报,2017,39(1):123-126. 被引量：3
5蒋维成,李兰英,刘华春,郭维树.基于任务特征的云计算资源分配策略[J].计算机与现代化,2017(7):79-84. 被引量：1
6朱宝珠,刘斌.考虑低能耗的云计算虚拟机部署方法[J].计算机工程与设计,2017,38(9):2319-2322. 被引量：1
7吴军英,辛锐,曹秀峰.云计算环境下多样性资源负载均衡高效调度方法[J].科技通报,2017,33(12):167-170. 被引量：4
8王峰.云计算下安全可控的移动弹性资源动态调度算法[J].科学技术与工程,2018,18(3):291-296. 被引量：5
9伊洪磊,张从昕.云计算技术在电力企业多项目管理软件发布中的应用研究[J].自动化与仪器仪表,2018,0(4):1-5. 被引量：3
10冯晓青.基于云计算的高校图书馆信息共享系统的构建与实现[J].电子设计工程,2018,26(18):34-38. 被引量：3

1金卓军,钱徽,陈沈轶,朱淼良.基于回报函数逼近的学徒学习综述[J].华中科技大学学报（自然科学版）,2008,36(S1):288-290. 被引量：2
2张笑笑,童键.基于MATLAB的函数图像绘制方法[J].福建电脑,2017,33(1):159-160.
3王子强,武继刚.基于RDC-Q学习算法的移动机器人路径规划[J].计算机工程,2014,40(6):211-214. 被引量：7
4吕柏权,李天铎,吕崇德,刘兆辉.一种用于函数学习的小波神经网络[J].自动化学报,1998,24(4):548-551. 被引量：9
5张秋林.Excel函数学习使用技巧研究[J].科技创业月刊,2016,29(9):86-87. 被引量：3
6彭宏,王军.函数学习的小波再生核支持向量回归模型(英文)[J].西华大学学报（自然科学版）,2006,25(5):41-44. 被引量：1
7侯艳丽.基于支持向量机和Q学习的移动机器人导航[J].计算机工程与应用,2011,47(23):242-244. 被引量：2
8赵增荣,韩提文.基于Q-Learning的智能体训练[J].石家庄铁道学院学报,2007,20(2):37-39. 被引量：1
9金卓军,钱徽,朱淼良.基于倾向性分析的轨迹评测技术[J].浙江大学学报（工学版）,2011,45(10):1732-1737. 被引量：1
10沈宪明,白瑞林,章智慧.模糊CMAC的硬件结构分析及其FPGA实现[J].计算机工程,2007,33(3):253-255.

智能系统学报

2009年第3期

浏览历史

内容加载中请稍等...

回报函数学习的学徒学习综述被引量：2

参考文献22

二级参考文献10

共引文献1

同被引文献31

引证文献2

二级引证文献35

相关作者

相关机构

相关主题

浏览历史

回报函数学习的学徒学习综述 被引量：2

参考文献22

二级参考文献10

共引文献1

同被引文献31

引证文献2

二级引证文献35

相关作者

相关机构

相关主题

浏览历史

回报函数学习的学徒学习综述被引量：2