Semi-Markov adaptive critic heuristics with application to airline revenue management 被引量：1

Semi-Markov adaptive critic heuristics with application to airline revenue management

导出

摘要 The adaptive critic heuristic has been a popular algorithm in reinforcement learning(RL) and approximate dynamic programming(ADP) alike.It is one of the ?rst RL and ADP algorithms.RL and ADP algorithms are particularly useful for solving Markov decision processes(MDPs) that suffer from the curses of dimensionality and modeling.Many real-world problems,however,tend to be semi-Markov decision processes(SMDPs) in which the time spent in each transition of the underlying Markov chains is itself a random variable.Unfortunately for the average reward case,unlike the discounted reward case,the MDP does not have an easy extension to the SMDP.Examples of SMDPs can be found in the area of supply chain management,maintenance management,and airline revenue management.In this paper,we propose an adaptive critic heuristic for the SMDP under the long-run average reward criterion.We present the convergence analysis of the algorithm which shows that under certain mild conditions,which can be ensured within a simulator,the algorithm converges to an optimal solution with probability 1.We test the algorithm extensively on a problem of airline revenue management in which the manager has to set prices for airline tickets over the booking horizon.The problem has a large scale,suffering from the curse of dimensionality,and hence it is difficult to solve it via classical methods of dynamic programming.Our numerical results are encouraging and show that the algorithm outperforms an existing heuristic used widely in the airline industry. The adaptive critic heuristic has been a popular algorithm in reinforcement learning(RL) and approximate dynamic programming(ADP) alike.It is one of the ?rst RL and ADP algorithms.RL and ADP algorithms are particularly useful for solving Markov decision processes(MDPs) that suffer from the curses of dimensionality and modeling.Many real-world problems,however,tend to be semi-Markov decision processes(SMDPs) in which the time spent in each transition of the underlying Markov chains is itself a random variable.Unfortunately for the average reward case,unlike the discounted reward case,the MDP does not have an easy extension to the SMDP.Examples of SMDPs can be found in the area of supply chain management,maintenance management,and airline revenue management.In this paper,we propose an adaptive critic heuristic for the SMDP under the long-run average reward criterion.We present the convergence analysis of the algorithm which shows that under certain mild conditions,which can be ensured within a simulator,the algorithm converges to an optimal solution with probability 1.We test the algorithm extensively on a problem of airline revenue management in which the manager has to set prices for airline tickets over the booking horizon.The problem has a large scale,suffering from the curse of dimensionality,and hence it is difficult to solve it via classical methods of dynamic programming.Our numerical results are encouraging and show that the algorithm outperforms an existing heuristic used widely in the airline industry.

作者 Ketaki KULKARNI Abhijit GOSAVI Susan MURRAY Katie GRANTHAM

机构地区 Department of Engineering Management and Systems Engineering

出处《控制理论与应用（英文版）》 EI 2011年第3期421-430,共10页

基金 supported by the National Science Foundation (No.ECS0841055)

关键词 Adaptive critics Actor critics Semi-Markov Approximate dynamic programming Reinforcement learning Adaptive critics Actor critics Semi-Markov Approximate dynamic programming Reinforcement learning

分类号 O211.62 [理学—概率论与数理统计]

引文网络
相关文献

参考文献37

1Abhijit Gosavi.A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis[J]. Machine Learning . 2004 (1)
2Abhijit Gosavi,Naveen Bandla,Tapas K. Das.A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking[J]. IIE Transactions . 2002 (9)
3R. Bellman.The theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America . 1952
4P. Abbeel,,A. Coates,M. Quigley,et al.An application of reinforcement learning to aerobatic helicopter fight. Advances in Neural Information Processing Systems 19 . 2006
5A. Gosavi.Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement Learning. . 2003
6A. Gosavi.Adaptive critics for airline revenue management. Proceedings of the Production and Operations Management Society . 2007
7D. P. Bertsekas.Dynamic Programming and Optimal Control. . 2000
8A. G. Barto,R. S. Sutton,C. W. Anderson.Neuronlike adaptive elements that can solve difficult learning control problems. Arti?cia Neural Networks . 1990
9F. L. Lewis,D. Vrabie.Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine . 2009
10S. N. Balakrishnan,J. Ding,F. L. Lewis.Issues on stability of adp feedback controllers for dynamical systems. IEEE Transactions on Systems Man and Cybernetics . 2008

同被引文献13

1钱晓超,唐伟,陈伟,陆营波,陆志沣.面向关键能力的陆军全域作战体系贡献率评估[J].系统仿真学报,2018,30(12):4786-4793. 被引量：12
2刘同林,涂震飚,张虎,吕远见.基于包络分析的装备体系贡献度评估方法[J].科技导报,2018,36(24):44-47. 被引量：2
3李炜,张恒,王玮.评价舰船装备体系贡献度的一种方法[J].舰船科学技术,2015,37(10):1-5. 被引量：24
4程坚.航天装备体系贡献率研究[J].科技促进发展,2017,13(5):358-362. 被引量：1
5赵丹玲,谭跃进,李际超,夏博远,豆亚杰,姬升平.基于作战环的武器装备体系贡献度评估[J].系统工程与电子技术,2017,39(10):2239-2247. 被引量：70
6范云龙,俞志强,蒋伟,何仁杰.防空预警体系雷达装备能力提升贡献度评估研究[J].空军预警学院学报,2017,31(6):436-441. 被引量：7
7陈文英,张兵志,史力晨,赵青松.新型智能装甲作战系统体系贡献率评估研究[J].兵工学报,2018,39(9):1841-1849. 被引量：25
8孙丽萍,潘俊文,胡德文.基于改良模糊灰色关联度的FPSO原油舱风险决策[J].哈尔滨工程大学学报,2018,39(11):1760-1766. 被引量：6
9陈立新.关于装备体系贡献率研究的几点思考[J].军事运筹与系统工程,2018,32(3):37-43. 被引量：28
10杨克巍,杨志伟,谭跃进,赵青松.面向体系贡献率的装备体系评估方法研究综述[J].系统工程与电子技术,2019,41(2):311-321. 被引量：68

引证文献1

1张大信,王超,郭基联,符凌云,钟季龙.基于结构和功能的改进CRITIC体系贡献率评估方法[J].火力与指挥控制,2021,46(6):39-46. 被引量：2

二级引证文献2

1秦长江,吴克宇,成清,黄金才.基于杀伤网贡献率的动态体系节点重要度评估[J].系统工程与电子技术,2023,45(6):1732-1742. 被引量：4
2孟宪良,张搏,张明亮,王金,孟乐,薛明.基于SEM的末端防御装备体系结构贡献率评估[J].现代防御技术,2024,52(5):1-8.

1S.N.BALAKRISHNAN.Approximate dynamic programming solutions with a single network adaptive critic for a class of nonlinear systems[J].控制理论与应用（英文版）,2011,9(3):370-380. 被引量：2
2Wang Liping,Guo Erjun,Jiang Wenyong,Xue Muyu,Liu Dongrong,Ren Shanzhi.Physical modeling of spent-nuclearfuel container[J].China Foundry,2012,9(4):366-369. 被引量：5
3Zbigniew Michalewicz Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA, and Institute of Computer Science, Polish Academy of Sciences, ul. Ordona 21, 01-237 Warsaw, Poland.Two Aspects of Evolutionary Algorithms[J].Wuhan University Journal of Natural Sciences,2000,5(4):413-424. 被引量：3
4Shubhendu BHASIN,Nitin SHARMA,Parag PATRE,Warren DIXON.Asymptotic tracking by a reinforcement learning-based adaptive critic controller[J].控制理论与应用（英文版）,2011,9(3):400-409. 被引量：1
5姜思杰,马玉林,蔡鹤皋.Genetics based dynamic optimal scheduling algorithm[J].Journal of Harbin Institute of Technology(New Series),1999,6(3):10-12.
6赵玉新.浅谈软件人机界面的评估方法及其应用[J].电脑知识与技术（过刊）,2011,17(1X):365-367. 被引量：1
7姜思杰,徐晓飞.An optimal algorithm for a class of tardiness scheduling problem[J].Journal of Harbin Institute of Technology(New Series),2000,7(2):62-64.
8李轩墨,高晓蓉,宋东利,王兴宇.基于Semi-Markov链的高速车轮缺陷退化模型[J].信息技术,2016,40(3):159-162. 被引量：1
9曹淼.人人DMP:数据的力量[J].声屏世界（广告人）,2015,0(2):160-160.
10续荣,李萍萍,王玉华,郭春霞.Linux系统中基于CDK的curses开发[J].通信与广播电视,2004(2):45-49.

控制理论与应用（英文版）

2011年第3期

浏览历史

内容加载中请稍等...

Semi-Markov adaptive critic heuristics with application to airline revenue management 被引量：1

参考文献37

同被引文献13

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史