Reinforcement learning can be modeled as markov decision process mathematically.In consequence,the interaction samples as well as the connection relation between them are two main types of information for learning.How...Reinforcement learning can be modeled as markov decision process mathematically.In consequence,the interaction samples as well as the connection relation between them are two main types of information for learning.However,most of recent works on deep reinforcement learning treat samples independently either in their own episode or between episodes.In this paper,in order to utilize more sample information,we propose another learning system based on directed associative graph(DAG).The DAG is built on all trajectories in real time,which includes the whole connection relation of all samples among all episodes.Through planning with directed edges on DAG,we offer another perspective to estimate stateaction pair,especially for the unknowns to deep neural network(DNN)as well as episodic memory(EM).Mixed loss function is generated by the three learning systems(DNN,EM and DAG)to improve the efficiency of the parameter update in the proposed algorithm.We show that our algorithm is significantly better than the state-of-the-art algorithm in performance and sample efficiency on testing environments.Furthermore,the convergence of our algorithm is proved in the appendix and its long-term performance as well as the effects of DAG are verified.展开更多
基金This work is supported by the National Key Research and Development Program of China,2018YFA0701603 and Natural Science Foundation of Anhui Province,2008085MF213.
文摘Reinforcement learning can be modeled as markov decision process mathematically.In consequence,the interaction samples as well as the connection relation between them are two main types of information for learning.However,most of recent works on deep reinforcement learning treat samples independently either in their own episode or between episodes.In this paper,in order to utilize more sample information,we propose another learning system based on directed associative graph(DAG).The DAG is built on all trajectories in real time,which includes the whole connection relation of all samples among all episodes.Through planning with directed edges on DAG,we offer another perspective to estimate stateaction pair,especially for the unknowns to deep neural network(DNN)as well as episodic memory(EM).Mixed loss function is generated by the three learning systems(DNN,EM and DAG)to improve the efficiency of the parameter update in the proposed algorithm.We show that our algorithm is significantly better than the state-of-the-art algorithm in performance and sample efficiency on testing environments.Furthermore,the convergence of our algorithm is proved in the appendix and its long-term performance as well as the effects of DAG are verified.