摘要
增强学习已经开始向关系增强学习发展,并且产生了许多新的算法.大部分方法将命题表达提升为关系或计算逻辑的表达.这些方法已经表现出许多好的性质,但是相关的理论分析目前还缺乏,即为什么这些关系的增强学习具有良好的性质,因此提出基础马尔可夫决策过程和逻辑马尔可夫决策过程的测度空间结构,利用现代概率论中条件数学期望和正则条件概率理论建立基础和逻辑两种马尔可夫决策过程之间的深刻联系,从而证实了逻辑马尔可夫决策过程中的最优策略在某种平均意义上是相应的基础马尔可夫决策过程的最优策略.最后由实例分析得出逻辑马尔可夫决策编程方法.建立逻辑马尔可夫决策过程的测度空间结构可以为关系增强学习提供数学理论框架.
Because of very large states in the real world, reinforcement learning develops towards relational reinforcement learning and many approaches are presented such as logical Markov decision processes. Many of these approaches are to upgrade propositional representations towards the use of relational or computational logic repre sentations. These approaches have already shown many good qualities. However,a relative theory is missing,that is, why do these relational reinforcement learning approaches have good quality? So we construct a ground measure space for underlying Markov decision processes and a logical measure space for logical Markov decision processes, and then use two profound concepts of conditional expectation and regular conditional probability in modern probability theory to combine the two spaces. In this way, we establish a profound relationship between underlying Markov decision process(MDP)and logical Markov decision process. Within this kind of mathematic framework we prove that an optimal policy found at abstraction level always is optimal at the ground level of the underlying Markov decision process in some average sense. Many relational reinforcement learning techniques have this property,but do not give such a proof. Moreover,we put forward definitely the semantics of the probability and of the reward function in an abstract transition of logical Markov decision processes. The Markov decision progress built on both the hound measure space and the logical measure space also reflects an important characteristic of human mind, that is, people when facing various problems especially complex problems,always tackle them from an abstract or principle perspective. Having gotten a whole plan,details are then considered. Finally, the plan is actually carried out. In this paper, we think of problems from two different levels. Logical MI)P corresponds an abstract level that can avoid tremendous and explicit states, thus making problems simple. When we find an optimal policy in the logical MDP,this means we have gotten an optimal resolution in the ground MI)P under the average meaning. Many techniques of relational reinforcement learning share with this char- acteristic. Through the framework we construct in this paper,the proof of this characteristic is very clear. In summary, we think this framework will not only bring a stochastic and intelligent style for reinforcement learning,hut also provide a sound basis for the verification of the validity of logical Markov decision process theory. Moreover, it can provide a new pattern for studying the characteristic of human mind. Just as people can manufacture aircraft only after having deeply understood the aerodynamics, we can deepen the studying of AI only when we have deeply understood the essence of human mind. Two spaces of Markov decision process is just the initial effort to try to deepen the understanding of human mind.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2013年第4期439-447,共9页
Journal of Nanjing University(Natural Science)
基金
金陵科技学院科研基金(jit-b-201207)