基于策略迭代和值迭代的POMDP算法被引量：7

A Policy-and Value-Iteration Algorithm for POMDP

下载PDF

导出

摘要部分可观察Markov决策过程是通过引入信念状态空间将非Markov链问题转化为Markov链问题来求解,其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察Markov决策过程的基本原理和决策过程,提出一种基于策略迭代和值迭代的部分可观察Markov决策算法,该算法利用线性规划和动态规划的思想,解决当信念状态空间较大时出现的"维数灾"问题,得到Markov决策的逼近最优解.实验数据表明该算法是可行的和有效的. Partially observable Markov decision processes （POMDP） changes the non-Markovian into Markovian over the belief state space. It has been an important branch of stochastic decision processes for its characteristics of describing the real world. Tradional algorithms to solve POMPDs are value iteration algorithm and policy iteration algorithm. However, the complexity of exact solution algorithms for such POMDPs are typically computationally intractable for all but the smallest problems. At first, the authors describe the principles and decision processes of POMDP, and then present a policy- and valueiteration algorithm （PVIA） for partially observable Markov decision processes. This algorithm uses advantages of policy iteration and value iteration when programming makes use of policy iteration and when computing uses value iteration. This algorithm using linear programming and dynamic programming resolves curse of dimensionality problem when the belief state is large, and obtains the approximate optimal value. A key contribution of this paper is that it shows how the basic operations of both algorithms can be performed effciently together. The algorithm was implemented in the SZPT_Roc team, which took the 2nd place in the simulation league of the RoboCup 2006 Chinese Open Championship. Finally, compared with some typical algorithms, experimental results show that the algorithm is practical and feasible.

作者孙湧仵博冯延蓬

机构地区深圳职业技术学院电子与信息工程学院

出处《计算机研究与发展》 EI CSCD 北大核心 2008年第10期1763-1768,共6页 Journal of Computer Research and Development

关键词部分可观察Markov决策决策算法智能体值迭代策略迭代 POMDP decision algorithm agent value iteration policy iteration

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献10

1Stephen M Majercik, Michael L Littman. Contingent planning under uncertainty via stochastic satisfiability [J]. Artificial Intelligence, 2003, 147(1-2): 119-162
2刘客.实用马尔可夫决策过程[M].北京:清华大学出版社,2004
3Alexander L Strehl, Michael L Littman. An empirical evaluation of interval estimation for Markov decision processes [C] //Proc of the 16th IEEE Int on Tools with Artificial Intelligence Conference. Cambridge: The MIT Press, 2004:531-539
4Poupart P. Exploiting structure to efficiently solve large scale partially observable Markov decision processes[D]. Toronto: University of Toronto, 2005
5Sebastien Paquet, Ludovic Tobin, Brahim Chaibdraa. An online POMDP algorithm for complex multi agent environment [C] //Proc of the 4th Int Joint Conf on Autonomous Agents and multi Agent Systems (AAMAS 05). Netherlands: Utrecht University, 2005 : 970-977
6Haakan L S Younes, Michael L Littman, David Weissman, et al. The first probabilistic track of the international planning competition[J]. Journal of Artificial Intelligence Research, 2005, 24:851-887
7Pineau J, Gordon G, Thrun S. Point based value iteration: An anytime algorithm for POMDPs [C] //Proc of the Int Joint Conf on Artificial Intelligence (IJCAI). Mexico: Acapulco, 2003: 1025-1032
8David V Pynadath, Milind Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models [J]. Journal of Artificial Intelligence Research, 2002, 16: 389-423
9Daniel S Bernstein, Eric A Hansen, Shlomo Zilberstein. Bounded policy iteration for decentralized POMDPs [C] // Proc of the 19th Int Joint Conf on Artificial Intelligence (IJCAI-05). Menlo Park, CA:AAAI Press, 2005: 1287- 1292
10Claudia Goldman, Shlomo Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis [J]. Journal of Artificial Intelligence Research, 2004, 22:143-174

同被引文献104

1吴宏鑫,谈树萍.航天器控制的现状与未来[J].空间控制技术与应用,2012,38(5):1-7. 被引量：16
2黄民烈,朱小燕.对话管理中基于槽特征有限状态自动机的方法研究[J].计算机学报,2004,27(8):1092-1101. 被引量：7
3袁琰,田怀凤,杜波,陆汝占.基于框架的对话管理模型的研究与实现[J].计算机工程,2005,31(13):212-214. 被引量：5
4王菁华,钟义信,王枞,刘建毅.口语对话管理综述[J].计算机应用研究,2005,22(10):5-8. 被引量：8
5拜战胜,蓝岚,彭佳红,陈哲.对话系统中控制模型的比较研究[J].郑州大学学报（理学版）,2006,38(4):112-116. 被引量：3
6姚琳,梁春霞,张德干.基于实例推理的人机对话系统的设计与实现[J].计算机应用,2007,27(3):765-768. 被引量：3
7仵博,吴敏.部分可观察马尔可夫决策过程研究进展[J].计算机工程与设计,2007,28(9):2116-2119. 被引量：3
8张秋花,薛惠锋,吴介军,寇晓东.多智能体系统MAS及其应用[J].计算机仿真,2007,24(6):133-137. 被引量：24
9徐凯华,张德干,姚琳.基于Agent的人机对话系统的设计与实现[J].计算机工程,2007,33(16):264-266. 被引量：1
10BUSONIU L, BABUSKA R, De SCHUTTER B. A comprehensive survey of Multi-Agent reinforcement learning [ J]. IEEE Transac- tions on Systems, Man, and Cybernetics -- Part C: Applications and Reviews. 2008, 38(2) : 156 - 172.

引证文献7

1郑延斌,郭凌云,刘晶晶.多智能体系统分散式通信决策研究[J].计算机应用,2012,32(10):2875-2878. 被引量：3
2仵博,吴敏,佘锦华.基于点的POMDPs在线值迭代算法[J].软件学报,2013,24(1):25-36. 被引量：3
3仵博,吴敏.基于后验信念聚类的在线规划算法[J].计算机工程,2013,39(4):214-218.
4仵博,吴敏.基于Monte Carlo粒子滤波的POMDPs在线算法[J].控制与决策,2013,28(6):925-929. 被引量：1
5王玉,任福继,全昌勤.口语对话系统中对话管理方法研究综述[J].计算机科学,2015,42(6):1-7. 被引量：3
6穆朝絮,张勇,余瑶,孙长银.基于自适应动态规划的航空航天飞行器鲁棒控制研究综述[J].空间控制技术与应用,2019,45(4):71-79. 被引量：9
7张晓彤,王嘉诚,何景涛,陈仕韬,郑南宁.面向不确定性环境的自动驾驶运动规划:机遇与挑战[J].模式识别与人工智能,2023,36(1):1-21. 被引量：3

二级引证文献22

1仵博,吴敏.基于后验信念聚类的在线规划算法[J].计算机工程,2013,39(4):214-218.
2仵博,冯延蓬,孟宪军,江建举,何国坤.安防大数据下的分布式云计算模型[J].深圳职业技术学院学报,2014,13(1):3-6. 被引量：1
3仵博,郑红燕,冯延蓬,陈鑫.一种基于模型的可分解贝叶斯在线强化学习[J].电子学报,2014,42(7):1429-1434. 被引量：2
4胡晓辉,王振强,陈永.车-地通信场景的一种混合建模方法研究[J].计算机工程与应用,2015,51(16):228-233.
5戴剑勇,邹树梁,汪敏.放射性污染物多智能体系统免疫克隆选择优化[J].南华大学学报（社会科学版）,2016,17(3):10-14.
6王玉,黄忠.基于TFSM的情感教学系统对话管理建模与仿真研究[J].合肥学院学报（综合版）,2018,35(2):44-50.
7韩朝,苗夺谦,任福继.基于粗糙集理论的中文知识问答的知识谓词分析[J].计算机科学,2018,45(6):183-186.
8刘红庆,刘燕,伍俊良.基于高斯Monte Carlo粒子滤波的机动目标跟踪算法[J].控制工程,2018,25(9):1754-1759. 被引量：3
9徐扬,王建成,刘启元,李寿山.基于上下文信息的口语意图检测方法[J].计算机科学,2020,47(1):205-211. 被引量：5
10王忠丰.基于自适应末端滑膜控制的无人机倾斜摄影测量技术[J].计算机测量与控制,2020,28(8):88-92.

1张新良,石纯一.M-POMDP模型及其划分求解算法[J].清华大学学报（自然科学版）,2005,45(10):1413-1416. 被引量：3
2李泽雪,薛亮,李相民.基于改进蚁群算法的软件测试方法[J].兵工自动化,2017,36(2):70-74. 被引量：4
3冯延蓬,仵博,郑红燕.异构无线传感器网络中基于POMDP的实时调度算法[J].仪表技术与传感器,2012(8):101-104. 被引量：2
4冯延蓬,仵博,郑红燕.基于FPOMDP的无线传感器网络动态调度算法[J].计算机应用与软件,2012,29(8):55-58. 被引量：1
5冯延蓬,仵博,郑红燕,孟宪军.无线传感器网络目标跟踪动态簇成员调度策略[J].传感器与微系统,2012,31(7):26-29. 被引量：2
6陈鹏飞,李昕怡,齐勇,张小辉.单步启发式策略的备份虚拟机复用策略[J].西安交通大学学报,2016,50(1):100-107. 被引量：1
7冯奇,周雪忠,黄厚宽,张小平.SHP-VI:一种基于最短哈密顿通路的POMDP值迭代算法[J].计算机研究与发展,2011,48(12):2343-2351. 被引量：1
8宋春跃,王慧,李平.线性混杂系统优化控制的Monte Carlo统计预测方法[J].自动化学报,2008,34(8):1028-1032. 被引量：1
9宋春跃,李平.含扩散项不可靠生产系统最优生产控制的数值求解[J].控制理论与应用,2009,26(7):709-714. 被引量：1
10周果,赵会兵.区间占用检查逻辑的建模与安全分析[J].铁道学报,2016,38(4):66-73. 被引量：6

计算机研究与发展

2008年第10期

浏览历史

内容加载中请稍等...

基于策略迭代和值迭代的POMDP算法被引量：7

参考文献10

同被引文献104

引证文献7

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

基于策略迭代和值迭代的POMDP算法 被引量：7

参考文献10

同被引文献104

引证文献7

二级引证文献22

相关作者

相关机构

相关主题

浏览历史

基于策略迭代和值迭代的POMDP算法被引量：7