摘要
针对传统Actor-critic(AC)方法在求解连续空间序贯决策问题时收敛速度较慢、收敛质量不高的问题,提出一种基于对称扰动采样的AC算法框架.首先,框架采用高斯分布作为策略分布,在每一时间步对当前动作均值对称扰动,从而生成两个动作与环境并行交互;然后,基于两者的最大时域差分(TD)误差选取Agent的行为动作,并对值函数参数进行更新;最后,基于两者的平均常规梯度或增量自然梯度对策略参数进行更新.理论分析和仿真结果表明,所提框架具有较好的收敛性和计算效率.
When solving the sequential decision-making problems in continuous spaces, the traditional actor-critic(AC)methods are often difficult to get good convergence speed and quality. To overcome the above weakness, an AC algorithm framework, which uses a Gaussian distribution as the policy distribution, is proposed based on the symmetric perturbation sampling. At each time step, the framework generates two actions through two symmetric perturbations on the current action mean, and takes them to interact with the environment in parallel. Then, the framework selects the Agent's behavior action and updates the value-function parameters based on the maximum temporal difference(TD) error, and updates the policy parameters based on the average regular gradient or the average incremental natural gradient. The theoretical analysis and simulation results show that the framework not only has a better convergence performance, but also has a high computational efficiency.
出处
《控制与决策》
EI
CSCD
北大核心
2015年第12期2161-2167,共7页
Control and Decision
基金
国家自然科学基金项目(61100118
60671033)
海南省自然科学基金项目(613153)