基于回报函数逼近的学徒学习综述被引量：2

Survey of apprenticeship learning based on reward function approximating

导出

摘要回顾了基于回报函数逼近的学徒学习的发展历史,介绍了目前的主要工作,总结了学徒学习的一般方法,讨论了线性和非线性假设条件下的回报函数求解,比较了逆向增强学习(IRL)和边际最大化(MMP)两类逼近方法.基于IRL的学徒学习是一种通过迭代的方法用基回报函数的线性组合来逼近真实回报函数的过程.MMP方法可以看作是一类基于梯度下降的最优化方法.综合采用滤波及将策略函数概率化等方法可以降低对专家演示的最优要求.最后指出了该领域存在的问题,提出了未来的研究方向,如在部分可观察马尔可夫决策过程框架下的学徒学习及对不确定策略的学习等. This paper surveys reward function approximating based apprenticeship learning.Both the historical basis and a broad selection of current work are summarized.Two kinds of well-known frameworks,inverse reinforcement learning(IRL) and maximum margin planning(MMP),are discussed under the assumptions of both linear and nonlinear reward function.IRL based learning is an iterative process of approaching ideal reward function using linear combination of basis functions,MMP is a set of gradient-based algorithms for...

作者金卓军钱徽陈沈轶朱淼良

机构地区浙江大学计算机科学与技术学院

出处《华中科技大学学报（自然科学版）》 EI CAS CSCD 北大核心 2008年第S1期288-290,294,共4页 Journal of Huazhong University of Science and Technology(Natural Science Edition)

基金浙江省科技厅重大项目(2006c13096)

关键词学徒学习回报函数综述逆向增强学习边际最大化 apprenticeship learning reward function survey inverse reinforcement learning maximum margin planning

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献10

1Ratliff D N,Bagnell J A,Zinkevich M.Maximummargin planning[].Proceedings of the rd Inter-national Conference on Machine Learning.2006
2Ng Y,Russell J S.Algorithms for inverse reinforce-ment learning[].Proceedings of the SeventeenthInternational Conference on Machine Learning.2000
3Abbeel P Y,Ng Y.Apprenticeship learning via in-verse reinforcement learning[].Proceedings of theTwenty-first International Conference on MachineLearning.2004
4Kolter J Z,Abbeel P Y,Ng A.Hierarchical appren-ticeship learning with application to quadruped loco-motion[].Advances in Neural Information Process-ing Systems.2008
5Taskar B,Lacoste-Julien S,Jordan M.Structuredprediction via the extragradient method[].Proceed-ings of Neural Information Processing Systems.2005
6Pieter A,Andrew Y N.Exploration and apprentice-ship learning in reinforcement learning[].Proceed-ings of the nd International Conference on MachineLearning.2005
7Kolter J Z,Rodgers M P,Ng A Y.A complete con-trol architecture for quadruped locomotion over roughterrain[].Proceedings of the International Confer-ence on Robotics and Automation.2008
8Rebula J R,Neuhaus P D,Bonnlander B V,et al.Acontroller for the littledog quadruped walking onrough terrain[].IEEE International Conference onRobotics and Automation.2007
9Ratliff N,Bagnell J A,Srinivasa S.Imitation learn-ing for locomotion and manipulation. CMU-RI-TR-07-45 . 2007
10Kaelbling L P,Littman M L,Moore A W.Rein-forcement learning:a survey[].Journal of Artifi-cial Intelligence Research.1996

同被引文献26

1ATKESON C G,SCHAAL S.Robot learning from demonstration[C]//Proceedings of the Fourteenth International Conference on Machine Learning.Nashville,USA,1997:12-20.
2RATLIFF N D,BAGNELL J A,ZINKEVICH M A.Maximum margin planning[C]//Proceedings of the 23rd International Conference on Machine Learning.Pittsburgh,USA,2006:729-736.
3NG A Y,RUSSELL S J.Algorithms for inverse reinforcement learning[C]//Proceedings of the Seventeenth International Conference on Machine Learning.San Francisco,USA,2000:663-670.
4ABBEEL P,NG A Y.Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the Twenty-first International Conference on Machine Learning.Banff,Canada,2004:1-8.
5KOLTER J Z,ABBEEL P,NG A Y.Hierarchical apprenticeship learning with application to quadruped locomotion[C]//Advances in Neural Information Processing Systems.Cambridge,USA:MIT Press,2008.
6RATLIFF N,BAGNELL J A,ZINKEVICH M A.Subgradient methods for maximum margin structured learning[C]//Workshop on Learning in Structured Outputs Spaces at ICML.Pittsburgh,USA,2006.
7SYED U,BOWLING M,SCHAPIRE R E.Apprenticeship learning using linear programming[C]//Proceedings of the 25 International Conference on Machine Learning (ICML 2008).Helsinki,Finland,2008:1032-1039.
8SYED U,SCHAPIRE R E.A game-theoretic approach to apprenticeship learning[C]//Advances in Neural Information Processing Systems.Cambridge,USA:MIT Press,2008.
9GRIMES D B,RAJESH D R,RAO R P N.Learning nonparametric models for probabilistic imitation[C]//Proceedings of Neural Information Processing Systems.Cambridge,USA:MIT Press,2007:521-528.
10ABBEEL P,COATES A,QUIGLEY M,et al.An application of reinforcement learning to aerobatic helicopter flight[C]//Proceedings of Neural Information Processing Systems.Cambridge,USA:MIT Press,2007:1-8.

引证文献2

1金卓军,钱徽,陈沈轶,朱淼良.回报函数学习的学徒学习综述[J].智能系统学报,2009,4(3):208-212. 被引量：2
2宋莉,李大字,徐昕.逆强化学习算法、理论与应用研究综述[J].自动化学报,2024,50(9):1704-1723. 被引量：1

二级引证文献3

1薛涛,刘龙.云计算中虚拟机资源自动配置技术的研究[J].计算机应用研究,2016,33(3):759-764. 被引量：12
2林嘉豪,章宗长,姜冲,郝建业.基于生成对抗网络的模仿学习综述[J].计算机学报,2020,43(2):326-351. 被引量：23
3王振保,王子健,徐新喜,刘鑫,程韬,刘培朋,赵秀国,苏琛.急救机器人关键技术及装备发展现状[J].医疗卫生装备,2025,46(3):96-114.

1金卓军,钱徽,陈沈轶,朱淼良.回报函数学习的学徒学习综述[J].智能系统学报,2009,4(3):208-212. 被引量：2
2王子强,武继刚.基于RDC-Q学习算法的移动机器人路径规划[J].计算机工程,2014,40(6):211-214. 被引量：7
3侯艳丽.基于支持向量机和Q学习的移动机器人导航[J].计算机工程与应用,2011,47(23):242-244. 被引量：2
4赵增荣,韩提文.基于Q-Learning的智能体训练[J].石家庄铁道学院学报,2007,20(2):37-39. 被引量：1
5金卓军,钱徽,朱淼良.基于倾向性分析的轨迹评测技术[J].浙江大学学报（工学版）,2011,45(10):1732-1737. 被引量：1
6陈学松,刘富春.一类非线性动态系统基于强化学习的最优控制[J].控制与决策,2013,28(12):1889-1893. 被引量：9
7侯艳丽.基于最小二乘支持向量机的移动机器人导航[J].电子设计工程,2011,19(23):11-12. 被引量：1
8郑延斌,郭凌云,刘晶晶.多智能体系统分散式通信决策研究[J].计算机应用,2012,32(10):2875-2878. 被引量：3
9江凌波,马超,谈鉴峰,王加玉.一种新型过程神经元网络安全模型[J].中国科技论文,2013,8(4):321-326.
10仵博,陈鑫,郑红燕,冯延蓬.基于非负矩阵分解更新规则的部分可观察马尔可夫决策过程信念状态空间降维算法[J].电子与信息学报,2013,35(12):2901-2907. 被引量：1

华中科技大学学报（自然科学版）

2008年第S1期

浏览历史

内容加载中请稍等...

基于回报函数逼近的学徒学习综述被引量：2

参考文献10

同被引文献26

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于回报函数逼近的学徒学习综述 被引量：2

参考文献10

同被引文献26

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于回报函数逼近的学徒学习综述被引量：2