摘要
在已知状态迁移条件下,利用传统概率规划技术能够获得确定的规划规则.而强化学习技术能够在未知环境条件下,利用试错和奖赏函数在线学习动态环境的策略知识.因此一种自适应的概率规划规则抽取算法被提出.该算法首先在强化学习获得的最优状态-动作对值函数基础上,通过迭代得到有折扣无奖赏的值函数和无折扣无奖赏的值函数.然后通过子规划剪枝将大于指定规划步数的子规划去除,并得到子规划剪枝后的状态-动作对值函数.最后通过Beamsearch算法从值函数中抽取满足概率规划条件的规划知识,从而在规划模型变化的条件下.也可以获得确定的概率规划规则.实验证明,这种自适应概率规划规则抽取算法是有效的.
With traditional probabilistic plan technology, the decision plan rules can be gotten only when the state transition function is already known. When the state transition function is unknown, the reinforcement learning technology can be adopted to learn the optimal policy knowledge of the dynamic system online by means of the trialanderror and immediate rewards. So a novel plan rule extracting algorithm is brought forward and discussed carefully in this paper. First, if the immediate rewards are not computed, two value functions can be gotten respectively, considering the discount factor or not considering the discount factor. This first step is an iterated process based on the optimal value function of the stateaction in pairs in the reinforcement learning. Then, we use the subplan pruning to remove the subplans whose step oversteps the number of the decision plan step, thus getting the revised value function of the stateaction pairs after the subplan pruning. Finally, like the Beam search algorithm, the plan rule which satisfies conditions can be extracted on the basis of the policy knowledge. Further more, when the state transition function is not known or changes dynamically, agent can also learn the decision plan. The experiment proves the validity and the convergence of this algorithm.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2003年第2期145-152,共8页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(69905001
60103012)