优化策略模型下的匹配律算法

Algorithm of matching law based on optimal policy search model

下载PDF

导出

摘要利用基于部分可观马尔可夫决策过程的策略搜索模型,提出了一种具有优化行为的策略搜索算法,并推导出满足匹配律的策略算法.被试可通过调整策略参数,最大化目标值函数的期望值,并根据已往的经验调整策略参数.假定被试所处的环境具有马尔可夫性,通过计算值函数期望值的梯度可求得优化行为的策略搜索算法.理论分析与仿真结果表明,如果策略参数与值函数的期望值仅受当前经验的影响,则可由获得优化行为的策略算法推导出符合匹配律的策略算法.研究结果揭示了匹配行为与优化策略搜索算法之间的关系,表明满足匹配律的决策行为是一类达到次优的决策行为. Based on the policy search algorithm in partially observable Markov decision process(POMDP),an optimal policy search algorithm is proposed.An algorithm leading to matching law is then derived from the optimal algorithm.The aim of the subject can find a policy parameter that can maximize the expected value of a value function,and the policy parameter is updated on the experience of the subject.Due to the Markov assumption for the environment,the optimal policy algorithm can be obtained from computing the gradient of the expected value of the value function.Theoretical analysis and simulation results show that the decision behavior achieved by this algorithm is able to reach matching law.The matching law can be met if one subject tries to maximize the expected value of the value function under the simple assumption that past choice behaviors do not affect the expected value of the value function and the current policy.It reveals the relationship between the matching behavior and the optimal policy search algorithm,and suggests that the matching behavior is a suboptimal decision behavior.

作者程振波邓志东

机构地区清华大学智能技术与系统国家重点实验室清华信息科学与技术国家实验室清华大学计算机科学与技术系

出处《东南大学学报（自然科学版）》 EI CAS CSCD 北大核心 2009年第S1期146-151,共6页 Journal of Southeast University：Natural Science Edition

基金国家自然科学基金资助项目(60621062 60775040)

关键词部分可观马尔可夫决策过程再励学习优化策略搜索匹配律 partially observable Markov decision process reinforcement learning optimal policy search matching law

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献16

1Ronald J. Williams.Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning[J]. Machine Learning . 1992 (3-4)
2Sakai Y,Fukai T,Bussey T.When does reward maximization lead to matching law?. PLoS One . 2008
3Peshkin L.Reinforcement learning by policy search. . 2001
4Loewenstein Y,Seung H S.Operant matching is a generic outcome of synaptic plasticity based on the covariance betweenreward and neural activity. Proceedings of the National Academy of Sciences of the United States of America . 2006
5Soltani A,Wang X J.A biophysically based neural model of matching law behavior:melioration by stochastic synapses. Journal de Neuroradiologie . 2006
6Corrado G S,,Sugrue L P,Seung H S,et al.Linear-nonlinear-poisson models of primate choice dynamics. J Exp AnalBehav . 2005
7Herrnstein R J,Prelec D.Melioration:a theory of distributed choice. The Journal of Economic Perspectives . 1991
8Hinson J M,Staddon J E.Matching,maximizing,and hill-climbing. Journal of the Experimental Analysis of Behavior . 1983
9Sakai Y,Fukai T.The actor-critic learning is behind the matching law:matching versus optimal behaviors. Neural Computing and Applications . 2008
10Sugrue,LP,Corrado,GS,Newsome,WT.Matching behavior and the representation of value in the parietal cortex. Science . 2004

1董春利,王莉.基于粒子滤波的直接策略搜索强化学习算法研究[J].江苏科技信息,2017,34(7):71-73.
2孙燕,武书彦,刘久富,刘文渊,刘海洋,杨忠.高铁进出站控制系统的Petri网故障诊断研究[J].广西大学学报（自然科学版）,2016,41(2):535-540. 被引量：3
3刘久富,刘文良,周建勇,刘海阳,王志胜,刘春生.改进的部分可观Petri网系统在线故障诊断器设计[J].控制理论与应用,2015,32(7):866-872. 被引量：6
4王学宁,陈伟,张锰,徐昕,贺汉根.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1):16-24. 被引量：8
5张志坚,刘惟一.一个基于增强学习算法的路由模型[J].计算机科学,2006,33(5):49-51. 被引量：2
6王辉,于婧.几种经典的策略梯度算法性能对比[J].电脑知识与技术（过刊）,2014,20(10X):6937-6941. 被引量：1
7方欢,陆阳,岳峰,官骏鸣.实现故障无二义诊断的部分可观系统设计方法[J].系统仿真学报,2015,27(3):470-479.
8程玉虎,冯涣婷,王雪松.基于参数探索的期望最大化策略搜索[J].自动化学报,2012,38(1):38-45. 被引量：4
9徐锐,康立山,陈毓屏.对策论中最优策略搜索的协同进化演化算法[J].计算机工程与设计,2004,25(11):1966-1968.
10彭晓红,刘文良,于杰,孙燕,刘文渊,刘海阳,鲍建成,刘久富.部分可观Petri网系统的在线故障诊断方法[J].城市轨道交通研究,2016,19(12):6-12.

东南大学学报（自然科学版）

2009年第S1期

浏览历史

内容加载中请稍等...

优化策略模型下的匹配律算法

参考文献16

相关作者

相关机构

相关主题

浏览历史