摘要
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。
A reinforcement-learning agent solves its decision problems by learning optimal decision mapping from a state to an action.Reinforcement learning method is that an agent improves its actions by using experiment and interacting on environment.Markov Decision Process(MDP)model is the general frame for solving reinforcement learning problems.The Dynamic Programming(DP)method is the Agent learning value functions algorithm relating with policy in Markov Decision Processes Environment.Generally,the agent must remember all of value functions.This remembering quantity becomes very big while the state space increasing.In this it gives a forgetting algorithm,which introduces the forgetting principles in psychology to reinforcement learning about value functions.Using forgetting algorithm,it discusses that problem describing above.
出处
《计算机工程与应用》
CSCD
北大核心
2004年第16期75-78,81,共5页
Computer Engineering and Applications
基金
国家自然科学基金项目资助(编号:60075019)
关键词
激励学习
MARKOV决策过程
动态规划
值函数
记忆
遗忘算法
peinforcement learning,Markov Decision Process,Dynamics Programming,value function,remember,forget algorithm