求解POMDP的动态合并激励学习算法被引量：1

Dynamic Merge Reinforcement Learning Algorithm for Solving POMDP

下载PDF

导出

摘要把POMDP作为激励学习(ReinforcementLearning)问题的模型,对于具有大状态空间问题的求解有比较好的适应性和有效性。但由于其求解的难度远远地超过了一般的Markov决策过程(MDP)的求解,因此还有许多问题有待解决。该文基于这样的背景,在给定一些特殊的约束条件下提出的一种求解POMDP的方法,即求解POMDP的动态合并激励学习算法。该方法利用区域的概念,在环境状态空间上建立一个区域系统,Agent在区域系统的每个区域上独自并行地实现其最优目标,加快了运算速度。然后把各组成部分的最优值函数按一定的方式整合,最后得出POMDP的最优解。 This paper advances a new algorithm for solving a POMDP with some restriction conditions, which is the dynamic merge reinforcement learning method for solving a POMDE This algorithm approves the conception of regions and then the paper sets up a regional system on state space of environment. The agent searches its optimal sub-goal separately at each region in regional system using parallel method, for the sake of speeding up the computations over this algorithm, and then merges these optimal solutions on each region to get a global optimal solution for this POMDP.

作者殷苌茗王汉兴陈焕文谢丽娟

机构地区上海大学理学院长沙理工大学计算机与通信工程学院

出处《计算机工程》 EI CAS CSCD 北大核心 2005年第22期4-6,148,共4页 Computer Engineering

基金国家自然科学基金资助项目(60075019)

关键词部分可观测Markov决策过程激励学习动态合并信度状态 Partially observable Markov decision process Reinforcement learning Dynamic merge Belief state

分类号 TP182 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献5

1McCallum A. Efficient Exploration in Reinforcement Learning with Hidden State. In: AAAI Fall Symposium on Model-directed Autonomous Systems, 1997.
2McCallum A. Reinforcement Learning with Selective Perception and Hidden State[ Ph.D. Thesis]. Rochester NY: Dept. of Computer Science, University of Rochester, 1995.
3Zhang N L, Liu W. A Model Approximation Scheme for Planning in Partially Observable Stochastic Domains. Journal of Artificial Intelligence Research, 1997,(7): 199-230.
4Singh S,Cohn D.How to Dynamically Merge Markov Decision Processes. In: Proceedings of Neural Information Processing Systems,NIPS 12, 1999.
5Sutton R S, Barto A G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,1998.

同被引文献5

1Boutilier C,Dearden R,Goldszmidt M.Stochastic Dynamic Programming with Factored Representations[J].Artificial Intelligence,2000,121(1/2):49-107.
2Kersting K,Raedt L D.Logical Markov Decision Programs and the Convergence of TD(λ)[C]//Proceedings of the Conference on Inductive Logic Programming.NY,USA:[s.n.],2004:217-231.
3Sutton R S,Barto A G Reinforcement Learning:An Introduction[M].Cambridge,UK:MIT Press,1998.
4Guestrin C,Roller D,Parr R.Efficient Solution Algorithms for Factored MDPs[J].Journal of Artificial Intelligence Research,2003,19(1):399-468.
5Kim K E,Dean T.Solving Factored MDPs Using Non homogeneous Partitions[J].Artificial Intelligence,2003,147(3):225-251.

引证文献1

1戴帅,殷苌茗,张欣.基于因素化表示的TD(λ)算法[J].计算机工程,2009,35(13):190-192.

1周斌,毕传美.一种基于频繁项挖掘的大量小文件动态合并算法[J].中南民族大学学报（自然科学版）,2016,35(4):111-115. 被引量：1
2吕洁.改进的分水岭算法在医学图像的分割[J].现代计算机,2011(14):28-31.
3赵春江.在ASP．NET中动态合并Table的单元格[J].电脑编程技巧与维护,2008(16):31-32.
4陈焕文,殷苌茗,谢丽娟.U-Clustering:基于效用聚类的激励学习算法[J].计算机工程与应用,2005,41(26):37-42.
5赵凯,陈雪峰.一种基于BP神经网络的建筑防火检测改进方法[J].科技通报,2013,29(9):202-205. 被引量：1
6赵晓萌,刘李楠.基于模拟退火方法BP神经网络的测向定位方法[J].安阳工学院学报,2012,11(2):35-38. 被引量：1
7唐中勇,付强,卓佳,陈焕文.一类基于启发式搜索的激励学习算法[J].计算机技术与发展,2006,16(8):41-43. 被引量：2
8温珂.动态参数神经网络的投资银行风险预测模型[J].科技通报,2015,31(9):192-195. 被引量：2
9刘海涛,洪炳熔,朴松昊,王雪梅.不确定性环境下基于进化算法的强化学习[J].电子学报,2006,34(7):1356-1360. 被引量：12
10张耀辉,王少华,刘文宝.基于定性信息的装备技术状态评估[J].火力与指挥控制,2013,38(4):91-95.

计算机工程

2005年第22期

浏览历史

内容加载中请稍等...

求解POMDP的动态合并激励学习算法被引量：1

参考文献5

同被引文献5

引证文献1

相关作者

相关机构

相关主题

浏览历史

求解POMDP的动态合并激励学习算法 被引量：1

参考文献5

同被引文献5

引证文献1

相关作者

相关机构

相关主题

浏览历史

求解POMDP的动态合并激励学习算法被引量：1