This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with u...This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.展开更多
Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus o...Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.展开更多
Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of...Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of the mobility of mobile equipments(MEs), if MEs move among the reach of the small cell networks(SCNs), the offloaded tasks cannot be returned to MEs successfully. As a result, migration incurs additional costs. In this paper, joint task offloading and migration schemes in mobility-aware Mobile Edge Computing(MEC) network based on Reinforcement Learning(RL) are proposed to obtain the maximum system revenue. Firstly, the joint optimization problems of maximizing the total revenue of MEs are put forward, in view of the mobility-aware MEs. Secondly, considering time-varying computation tasks and resource conditions, the mixed integer non-linear programming(MINLP) problem is described as a Markov Decision Process(MDP). Then we propose a novel reinforcement learning-based optimization framework to work out the problem, instead traditional methods. Finally, it is shown that the proposed schemes can obviously raise the total revenue of MEs by giving simulation results.展开更多
This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)mod...This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.展开更多
With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time...With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity.展开更多
The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the cro...The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the crowded spectrum, the time-varying channels, and the malicious intelligent jamming. The existing frequency hopping, automatic link establishment and some new anti-jamming technologies can not completely solve the above problems. In this article, we adopt deep reinforcement learning to solve this intractable challenge. First, the combination of the spectrum state and the channel gain state is defined as the complex environmental state, and the Markov characteristic of defined state is analyzed and proved. Then, considering that the spectrum state and channel gain state are heterogeneous information, a new deep Q network(DQN) framework is designed, which contains multiple sub-networks to process different kinds of information. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning(HIF-DRL) algorithm is designed for the specific frequency selection. Simulation results show that the proposed algorithm performs well in channel prediction, jamming avoidance and frequency channel selection.展开更多
This paper studies price-based residential demand response management(PB-RDRM)in smart grids,in which non-dispatchable and dispatchable loads(including general loads and plug-in electric vehicles(PEVs))are both involv...This paper studies price-based residential demand response management(PB-RDRM)in smart grids,in which non-dispatchable and dispatchable loads(including general loads and plug-in electric vehicles(PEVs))are both involved.The PB-RDRM is composed of a bi-level optimization problem,in which the upper-level dynamic retail pricing problem aims to maximize the profit of a utility company(UC)by selecting optimal retail prices(RPs),while the lower-level demand response(DR)problem expects to minimize the comprehensive cost of loads by coordinating their energy consumption behavior.The challenges here are mainly two-fold:1)the uncertainty of energy consumption and RPs;2)the flexible PEVs’temporally coupled constraints,which make it impossible to directly develop a model-based optimization algorithm to solve the PB-RDRM.To address these challenges,we first model the dynamic retail pricing problem as a Markovian decision process(MDP),and then employ a model-free reinforcement learning(RL)algorithm to learn the optimal dynamic RPs of UC according to the loads’responses.Our proposed RL-based DR algorithm is benchmarked against two model-based optimization approaches(i.e.,distributed dual decomposition-based(DDB)method and distributed primal-dual interior(PDI)-based method),which require exact load and electricity price models.The comparison results show that,compared with the benchmark solutions,our proposed algorithm can not only adaptively decide the RPs through on-line learning processes,but also achieve larger social welfare within an unknown electricity market environment.展开更多
Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making providedthat reinforcement learning algorithms introduce a computational concept of agency to the learning pro...Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making providedthat reinforcement learning algorithms introduce a computational concept of agency to the learning problem.Hence it addresses an abstract class of problems that can be characterized as follows: An algorithm confronted withinformation from an unknown environment is supposed to find step wise an optimal way to behave based only on somesparse, delayed or noisy feedback from some environment, that changes according to the algorithm’s behavior. Hencereinforcement learning offers an abstraction to the problem of goal-directed learning from interaction. The paper offersan opinionated introduction in the algorithmic advantages and drawbacks of several algorithmic approaches to providealgorithmic design options.展开更多
Deep reinforcement learning (deep RL) has the potential to replace classic robotic controllers. State-of-the-art Deep Reinforcement algorithms such as Proximal Policy Optimization, Twin Delayed Deep Deterministic Poli...Deep reinforcement learning (deep RL) has the potential to replace classic robotic controllers. State-of-the-art Deep Reinforcement algorithms such as Proximal Policy Optimization, Twin Delayed Deep Deterministic Policy Gradient and Soft Actor-Critic Reinforcement Algorithms, to mention a few, have been investigated for training robots to walk. However, conflicting performance results of these algorithms have been reported in the literature. In this work, we present the performance analysis of the above three state-of-the-art Deep Reinforcement algorithms for a constant velocity walking task on a quadruped. The performance is analyzed by simulating the walking task of a quadruped equipped with a range of sensors present on a physical quadruped robot. Simulations of the three algorithms across a range of sensor inputs and with domain randomization are performed. The strengths and weaknesses of each algorithm for the given task are discussed. We also identify a set of sensors that contribute to the best performance of each Deep Reinforcement algorithm.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.12072090)。
文摘This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.
基金supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155885, Artificial Intelligence Convergence Innovation Human Resources Development (Hanyang University ERICA))supported by the National Natural Science Foundation of China under Grant No. 61971264the National Natural Science Foundation of China/Research Grants Council Collaborative Research Scheme under Grant No. 62261160390
文摘Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.
基金supported in part by the National Natural Science Foundation of China under Grant 61701038。
文摘Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of the mobility of mobile equipments(MEs), if MEs move among the reach of the small cell networks(SCNs), the offloaded tasks cannot be returned to MEs successfully. As a result, migration incurs additional costs. In this paper, joint task offloading and migration schemes in mobility-aware Mobile Edge Computing(MEC) network based on Reinforcement Learning(RL) are proposed to obtain the maximum system revenue. Firstly, the joint optimization problems of maximizing the total revenue of MEs are put forward, in view of the mobility-aware MEs. Secondly, considering time-varying computation tasks and resource conditions, the mixed integer non-linear programming(MINLP) problem is described as a Markov Decision Process(MDP). Then we propose a novel reinforcement learning-based optimization framework to work out the problem, instead traditional methods. Finally, it is shown that the proposed schemes can obviously raise the total revenue of MEs by giving simulation results.
基金supported by the National Defense Science and Technology Innovation(18-163-15-LZ-001-004-13).
文摘This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.
基金Supported by the National Natural Science Foundation of China(No.U1633115)the Science and Technology Foundation of Beijing Municipal Commission of Education(No.KM201810005027)。
文摘With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity.
基金Supported by National Natural Science Foundation of China(60474035),National Research Foundation for the Doctoral Program of Higher Education of China(20050359004),Natural Science Foundation of Anhui Province(070412035)
基金supported by Guangxi key Laboratory Fund of Embedded Technology and Intelligent System under Grant No. 2018B-1the Natural Science Foundation for Distinguished Young Scholars of Jiangsu Province under Grant No. BK20160034+1 种基金the National Natural Science Foundation of China under Grant No. 61771488, No. 61671473 and No. 61631020in part by the Open Research Foundation of Science and Technology on Communication Networks Laboratory
文摘The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the crowded spectrum, the time-varying channels, and the malicious intelligent jamming. The existing frequency hopping, automatic link establishment and some new anti-jamming technologies can not completely solve the above problems. In this article, we adopt deep reinforcement learning to solve this intractable challenge. First, the combination of the spectrum state and the channel gain state is defined as the complex environmental state, and the Markov characteristic of defined state is analyzed and proved. Then, considering that the spectrum state and channel gain state are heterogeneous information, a new deep Q network(DQN) framework is designed, which contains multiple sub-networks to process different kinds of information. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning(HIF-DRL) algorithm is designed for the specific frequency selection. Simulation results show that the proposed algorithm performs well in channel prediction, jamming avoidance and frequency channel selection.
基金This work was supported in part by the National Natural Science Foundation of China(61922076,61725304,61873252,61991403,61991400)in part by the Australian Research Council Discovery Program(DP200101199).
文摘This paper studies price-based residential demand response management(PB-RDRM)in smart grids,in which non-dispatchable and dispatchable loads(including general loads and plug-in electric vehicles(PEVs))are both involved.The PB-RDRM is composed of a bi-level optimization problem,in which the upper-level dynamic retail pricing problem aims to maximize the profit of a utility company(UC)by selecting optimal retail prices(RPs),while the lower-level demand response(DR)problem expects to minimize the comprehensive cost of loads by coordinating their energy consumption behavior.The challenges here are mainly two-fold:1)the uncertainty of energy consumption and RPs;2)the flexible PEVs’temporally coupled constraints,which make it impossible to directly develop a model-based optimization algorithm to solve the PB-RDRM.To address these challenges,we first model the dynamic retail pricing problem as a Markovian decision process(MDP),and then employ a model-free reinforcement learning(RL)algorithm to learn the optimal dynamic RPs of UC according to the loads’responses.Our proposed RL-based DR algorithm is benchmarked against two model-based optimization approaches(i.e.,distributed dual decomposition-based(DDB)method and distributed primal-dual interior(PDI)-based method),which require exact load and electricity price models.The comparison results show that,compared with the benchmark solutions,our proposed algorithm can not only adaptively decide the RPs through on-line learning processes,but also achieve larger social welfare within an unknown electricity market environment.
文摘Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making providedthat reinforcement learning algorithms introduce a computational concept of agency to the learning problem.Hence it addresses an abstract class of problems that can be characterized as follows: An algorithm confronted withinformation from an unknown environment is supposed to find step wise an optimal way to behave based only on somesparse, delayed or noisy feedback from some environment, that changes according to the algorithm’s behavior. Hencereinforcement learning offers an abstraction to the problem of goal-directed learning from interaction. The paper offersan opinionated introduction in the algorithmic advantages and drawbacks of several algorithmic approaches to providealgorithmic design options.
文摘Deep reinforcement learning (deep RL) has the potential to replace classic robotic controllers. State-of-the-art Deep Reinforcement algorithms such as Proximal Policy Optimization, Twin Delayed Deep Deterministic Policy Gradient and Soft Actor-Critic Reinforcement Algorithms, to mention a few, have been investigated for training robots to walk. However, conflicting performance results of these algorithms have been reported in the literature. In this work, we present the performance analysis of the above three state-of-the-art Deep Reinforcement algorithms for a constant velocity walking task on a quadruped. The performance is analyzed by simulating the walking task of a quadruped equipped with a range of sensors present on a physical quadruped robot. Simulations of the three algorithms across a range of sensor inputs and with domain randomization are performed. The strengths and weaknesses of each algorithm for the given task are discussed. We also identify a set of sensors that contribute to the best performance of each Deep Reinforcement algorithm.