A stochastic resource allocation model, based on the principles of Markov decision processes(MDPs), is proposed in this paper. In particular, a general-purpose framework is developed, which takes into account resource...A stochastic resource allocation model, based on the principles of Markov decision processes(MDPs), is proposed in this paper. In particular, a general-purpose framework is developed, which takes into account resource requests for both instant and future needs. The considered framework can handle two types of reservations(i.e., specified and unspecified time interval reservation requests), and implement an overbooking business strategy to further increase business revenues. The resulting dynamic pricing problems can be regarded as sequential decision-making problems under uncertainty, which is solved by means of stochastic dynamic programming(DP) based algorithms. In this regard, Bellman’s backward principle of optimality is exploited in order to provide all the implementation mechanisms for the proposed reservation pricing algorithm. The curse of dimensionality, as the inevitable issue of the DP both for instant resource requests and future resource reservations,occurs. In particular, an approximate dynamic programming(ADP) technique based on linear function approximations is applied to solve such scalability issues. Several examples are provided to show the effectiveness of the proposed approach.展开更多
An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their...An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their performance measures and performance potentials, the optimiza-tion of an SMDP can be realized by simulating the chain. For the critic model of neuro-dynamicprogramming (NDP), a neuro-policy iteration (NPI) algorithm is presented, and the performanceerror bound is shown as there are approximate error and improvement error in each iteration step.The obtained results may be extended to Markov systems, and have much applicability. Finally, anumerical example is provided.展开更多
This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with u...This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.展开更多
Edge caching has emerged as a promising application paradigm in 5G networks,and by building edge networks to cache content,it can alleviate the traffic load brought about by the rapid growth of Internet of Things(IoT)...Edge caching has emerged as a promising application paradigm in 5G networks,and by building edge networks to cache content,it can alleviate the traffic load brought about by the rapid growth of Internet of Things(IoT)services and applications.Due to the limitations of Edge Servers(ESs)and a large number of user demands,how to make the decision and utilize the resources of ESs are significant.In this paper,we aim to minimize the total system energy consumption in a heterogeneous network and formulate the content caching optimization problem as a Mixed Integer Non-Linear Programming(MINLP).To address the optimization problem,a Deep Q-Network(DQN)-based method is proposed to improve the overall performance of the system and reduce the backhaul traffic load.In addition,the DQN-based method can effectively solve the limitation of traditional reinforcement learning(RL)in complex scenarios.Simulation results show that the proposed DQN-based method can greatly outperform other benchmark methods,and significantly improve the cache hit rate and reduce the total system energy consumption in different scenarios.展开更多
Aiming at the rapid growth of network services,which leads to the problems of long service request processing time and high deployment cost in the deployment of network function virtualization service function chain(S...Aiming at the rapid growth of network services,which leads to the problems of long service request processing time and high deployment cost in the deployment of network function virtualization service function chain(SFC)under 5G networks,this paper proposes a multi-agent deep deterministic policy gradient optimization algorithm for SFC deployment(MADDPG-SD).Initially,an optimization model is devised to enhance the request acceptance rate,minimizing the latency and deploying the cost SFC is constructed for the network resource-constrained case.Subsequently,we model the dynamic problem as a Markov decision process(MDP),facilitating adaptation to the evolving states of network resources.Finally,by allocating SFCs to different agents and adopting a collaborative deployment strategy,each agent aims to maximize the request acceptance rate or minimize latency and costs.These agents learn strategies from historical data of virtual network functions in SFCs to guide server node selection,and achieve approximately optimal SFC deployment strategies through a cooperative framework of centralized training and distributed execution.Experimental simulation results indicate that the proposed method,while simultaneously meeting performance requirements and resource capacity constraints,has effectively increased the acceptance rate of requests compared to the comparative algorithms,reducing the end-to-end latency by 4.942%and the deployment cost by 8.045%.展开更多
Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper intro...Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.展开更多
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral cr...Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.展开更多
This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed int...This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed into several groups of recurrent states and a group of transient states resulting in a singularly perturbed MDP formulation. Instead of solving the original problem directly, a limit problem that is much simpler to handle is derived. On the basis of the optical solution of the limit problem, nearly optimal decisions are constructed for the original problem. The asymptotic optimality of the constructed control is obtained; the rate of convergence is ascertained.展开更多
Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus o...Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.展开更多
In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinfor...In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.展开更多
Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of...Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of the mobility of mobile equipments(MEs), if MEs move among the reach of the small cell networks(SCNs), the offloaded tasks cannot be returned to MEs successfully. As a result, migration incurs additional costs. In this paper, joint task offloading and migration schemes in mobility-aware Mobile Edge Computing(MEC) network based on Reinforcement Learning(RL) are proposed to obtain the maximum system revenue. Firstly, the joint optimization problems of maximizing the total revenue of MEs are put forward, in view of the mobility-aware MEs. Secondly, considering time-varying computation tasks and resource conditions, the mixed integer non-linear programming(MINLP) problem is described as a Markov Decision Process(MDP). Then we propose a novel reinforcement learning-based optimization framework to work out the problem, instead traditional methods. Finally, it is shown that the proposed schemes can obviously raise the total revenue of MEs by giving simulation results.展开更多
The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the cro...The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the crowded spectrum, the time-varying channels, and the malicious intelligent jamming. The existing frequency hopping, automatic link establishment and some new anti-jamming technologies can not completely solve the above problems. In this article, we adopt deep reinforcement learning to solve this intractable challenge. First, the combination of the spectrum state and the channel gain state is defined as the complex environmental state, and the Markov characteristic of defined state is analyzed and proved. Then, considering that the spectrum state and channel gain state are heterogeneous information, a new deep Q network(DQN) framework is designed, which contains multiple sub-networks to process different kinds of information. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning(HIF-DRL) algorithm is designed for the specific frequency selection. Simulation results show that the proposed algorithm performs well in channel prediction, jamming avoidance and frequency channel selection.展开更多
This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)mod...This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.展开更多
With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time...With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity.展开更多
Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making providedthat reinforcement learning algorithms introduce a computational concept of agency to the learning pro...Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making providedthat reinforcement learning algorithms introduce a computational concept of agency to the learning problem.Hence it addresses an abstract class of problems that can be characterized as follows: An algorithm confronted withinformation from an unknown environment is supposed to find step wise an optimal way to behave based only on somesparse, delayed or noisy feedback from some environment, that changes according to the algorithm’s behavior. Hencereinforcement learning offers an abstraction to the problem of goal-directed learning from interaction. The paper offersan opinionated introduction in the algorithmic advantages and drawbacks of several algorithmic approaches to providealgorithmic design options.展开更多
Deep reinforcement learning (deep RL) has the potential to replace classic robotic controllers. State-of-the-art Deep Reinforcement algorithms such as Proximal Policy Optimization, Twin Delayed Deep Deterministic Poli...Deep reinforcement learning (deep RL) has the potential to replace classic robotic controllers. State-of-the-art Deep Reinforcement algorithms such as Proximal Policy Optimization, Twin Delayed Deep Deterministic Policy Gradient and Soft Actor-Critic Reinforcement Algorithms, to mention a few, have been investigated for training robots to walk. However, conflicting performance results of these algorithms have been reported in the literature. In this work, we present the performance analysis of the above three state-of-the-art Deep Reinforcement algorithms for a constant velocity walking task on a quadruped. The performance is analyzed by simulating the walking task of a quadruped equipped with a range of sensors present on a physical quadruped robot. Simulations of the three algorithms across a range of sensor inputs and with domain randomization are performed. The strengths and weaknesses of each algorithm for the given task are discussed. We also identify a set of sensors that contribute to the best performance of each Deep Reinforcement algorithm.展开更多
Aim To find a more efficient learning method based on temporal difference learning for delayed reinforcement learning tasks. Methods A kind of Q learning algorithm based on truncated TD( λ ) with adaptive scheme...Aim To find a more efficient learning method based on temporal difference learning for delayed reinforcement learning tasks. Methods A kind of Q learning algorithm based on truncated TD( λ ) with adaptive schemes of λ value selection addressed to absorbing Markov decision processes was presented and implemented on computers. Results and Conclusion Simulations on the shortest path searching problems show that using adaptive λ in the Q learning based on TTD( λ ) can speed up its convergence.展开更多
文摘A stochastic resource allocation model, based on the principles of Markov decision processes(MDPs), is proposed in this paper. In particular, a general-purpose framework is developed, which takes into account resource requests for both instant and future needs. The considered framework can handle two types of reservations(i.e., specified and unspecified time interval reservation requests), and implement an overbooking business strategy to further increase business revenues. The resulting dynamic pricing problems can be regarded as sequential decision-making problems under uncertainty, which is solved by means of stochastic dynamic programming(DP) based algorithms. In this regard, Bellman’s backward principle of optimality is exploited in order to provide all the implementation mechanisms for the proposed reservation pricing algorithm. The curse of dimensionality, as the inevitable issue of the DP both for instant resource requests and future resource reservations,occurs. In particular, an approximate dynamic programming(ADP) technique based on linear function approximations is applied to solve such scalability issues. Several examples are provided to show the effectiveness of the proposed approach.
文摘An alpha-uniformized Markov chain is defined by the concept of equivalent infinitesimalgenerator for a semi-Markov decision process (SMDP) with both average- and discounted-criteria.According to the relations of their performance measures and performance potentials, the optimiza-tion of an SMDP can be realized by simulating the chain. For the critic model of neuro-dynamicprogramming (NDP), a neuro-policy iteration (NPI) algorithm is presented, and the performanceerror bound is shown as there are approximate error and improvement error in each iteration step.The obtained results may be extended to Markov systems, and have much applicability. Finally, anumerical example is provided.
基金supported by the National Natural Science Foundation of China(Grant No.12072090)。
文摘This work proposes a recorded recurrent twin delayed deep deterministic(RRTD3)policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise.The attack-defense engagement scenario is modeled as a partially observable Markov decision process(POMDP).Given the benefits of recurrent neural networks(RNNs)in processing sequence information,an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs.The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency.During training,the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent.The training curves show that the proposed RRTD3 successfully enhances data efficiency,training speed,and training stability.The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.
基金supported in part by the National Natural Science Foundation of China under Grant 62172255in part by the Outstanding Youth Program of Hubei Natural Science Foundation under Grant 2022CFA080the Wuhan AI Innovation Program(2022010702040056)。
文摘Edge caching has emerged as a promising application paradigm in 5G networks,and by building edge networks to cache content,it can alleviate the traffic load brought about by the rapid growth of Internet of Things(IoT)services and applications.Due to the limitations of Edge Servers(ESs)and a large number of user demands,how to make the decision and utilize the resources of ESs are significant.In this paper,we aim to minimize the total system energy consumption in a heterogeneous network and formulate the content caching optimization problem as a Mixed Integer Non-Linear Programming(MINLP).To address the optimization problem,a Deep Q-Network(DQN)-based method is proposed to improve the overall performance of the system and reduce the backhaul traffic load.In addition,the DQN-based method can effectively solve the limitation of traditional reinforcement learning(RL)in complex scenarios.Simulation results show that the proposed DQN-based method can greatly outperform other benchmark methods,and significantly improve the cache hit rate and reduce the total system energy consumption in different scenarios.
基金The financial support fromthe Major Science and Technology Programs inHenan Province(Grant No.241100210100)National Natural Science Foundation of China(Grant No.62102372)+3 种基金Henan Provincial Department of Science and Technology Research Project(Grant No.242102211068)Henan Provincial Department of Science and Technology Research Project(Grant No.232102210078)the Stabilization Support Program of The Shenzhen Science and Technology Innovation Commission(Grant No.20231130110921001)the Key Scientific Research Project of Higher Education Institutions of Henan Province(Grant No.24A520042)is acknowledged.
文摘Aiming at the rapid growth of network services,which leads to the problems of long service request processing time and high deployment cost in the deployment of network function virtualization service function chain(SFC)under 5G networks,this paper proposes a multi-agent deep deterministic policy gradient optimization algorithm for SFC deployment(MADDPG-SD).Initially,an optimization model is devised to enhance the request acceptance rate,minimizing the latency and deploying the cost SFC is constructed for the network resource-constrained case.Subsequently,we model the dynamic problem as a Markov decision process(MDP),facilitating adaptation to the evolving states of network resources.Finally,by allocating SFCs to different agents and adopting a collaborative deployment strategy,each agent aims to maximize the request acceptance rate or minimize latency and costs.These agents learn strategies from historical data of virtual network functions in SFCs to guide server node selection,and achieve approximately optimal SFC deployment strategies through a cooperative framework of centralized training and distributed execution.Experimental simulation results indicate that the proposed method,while simultaneously meeting performance requirements and resource capacity constraints,has effectively increased the acceptance rate of requests compared to the comparative algorithms,reducing the end-to-end latency by 4.942%and the deployment cost by 8.045%.
基金partly supported by National Key R&D Program of China(No.2018YFA0306701)the Australian Research Council(Nos.DP160101652 and DP180100691)+1 种基金National Natural Science Foundation of China(No.61832015)the Key Research Program of Frontier Sciences,Chinese Academy of Sciences。
文摘Markov decision process(MDP)offers a general framework for modelling sequential decision making where outcomes are random.In particular,it serves as a mathematical framework for reinforcement learning.This paper introduces an extension of MDP,namely quantum MDP(q MDP),that can serve as a mathematical model of decision making about quantum systems.We develop dynamic programming algorithms for policy evaluation and finding optimal policies for q MDPs in the case of finite-horizon.The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.
文摘Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discrete- event systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.
基金The research of this author is supported in part by the Office of Naval Research Grant N00014-96-1-0263.The research of this a
文摘This paper is concerned with the continuous-time Markov decision processes (MDP) having weak and strong interactions. Using a hierarchical approach, the state space of the underlying Markov chain can be decomposed into several groups of recurrent states and a group of transient states resulting in a singularly perturbed MDP formulation. Instead of solving the original problem directly, a limit problem that is much simpler to handle is derived. On the basis of the optical solution of the limit problem, nearly optimal decisions are constructed for the original problem. The asymptotic optimality of the constructed control is obtained; the rate of convergence is ascertained.
基金supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155885, Artificial Intelligence Convergence Innovation Human Resources Development (Hanyang University ERICA))supported by the National Natural Science Foundation of China under Grant No. 61971264the National Natural Science Foundation of China/Research Grants Council Collaborative Research Scheme under Grant No. 62261160390
文摘Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.
文摘In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.
基金supported in part by the National Natural Science Foundation of China under Grant 61701038。
文摘Intelligent edge computing carries out edge devices of the Internet of things(Io T) for data collection, calculation and intelligent analysis, so as to proceed data analysis nearby and make feedback timely. Because of the mobility of mobile equipments(MEs), if MEs move among the reach of the small cell networks(SCNs), the offloaded tasks cannot be returned to MEs successfully. As a result, migration incurs additional costs. In this paper, joint task offloading and migration schemes in mobility-aware Mobile Edge Computing(MEC) network based on Reinforcement Learning(RL) are proposed to obtain the maximum system revenue. Firstly, the joint optimization problems of maximizing the total revenue of MEs are put forward, in view of the mobility-aware MEs. Secondly, considering time-varying computation tasks and resource conditions, the mixed integer non-linear programming(MINLP) problem is described as a Markov Decision Process(MDP). Then we propose a novel reinforcement learning-based optimization framework to work out the problem, instead traditional methods. Finally, it is shown that the proposed schemes can obviously raise the total revenue of MEs by giving simulation results.
基金supported by Guangxi key Laboratory Fund of Embedded Technology and Intelligent System under Grant No. 2018B-1the Natural Science Foundation for Distinguished Young Scholars of Jiangsu Province under Grant No. BK20160034+1 种基金the National Natural Science Foundation of China under Grant No. 61771488, No. 61671473 and No. 61631020in part by the Open Research Foundation of Science and Technology on Communication Networks Laboratory
文摘The high-frequency(HF) communication is one of essential communication methods for military and emergency application. However, the selection of communication frequency channel is always a difficult problem as the crowded spectrum, the time-varying channels, and the malicious intelligent jamming. The existing frequency hopping, automatic link establishment and some new anti-jamming technologies can not completely solve the above problems. In this article, we adopt deep reinforcement learning to solve this intractable challenge. First, the combination of the spectrum state and the channel gain state is defined as the complex environmental state, and the Markov characteristic of defined state is analyzed and proved. Then, considering that the spectrum state and channel gain state are heterogeneous information, a new deep Q network(DQN) framework is designed, which contains multiple sub-networks to process different kinds of information. Finally, aiming to improve the learning speed and efficiency, the optimization targets of corresponding sub-networks are reasonably designed, and a heterogeneous information fusion deep reinforcement learning(HIF-DRL) algorithm is designed for the specific frequency selection. Simulation results show that the proposed algorithm performs well in channel prediction, jamming avoidance and frequency channel selection.
基金supported by the National Defense Science and Technology Innovation(18-163-15-LZ-001-004-13).
文摘This paper investigates the guidance method based on reinforcement learning(RL)for the coplanar orbital interception in a continuous low-thrust scenario.The problem is formulated into a Markov decision process(MDP)model,then a welldesigned RL algorithm,experience based deep deterministic policy gradient(EBDDPG),is proposed to solve it.By taking the advantage of prior information generated through the optimal control model,the proposed algorithm not only resolves the convergence problem of the common RL algorithm,but also successfully trains an efficient deep neural network(DNN)controller for the chaser spacecraft to generate the control sequence.Numerical simulation results show that the proposed algorithm is feasible and the trained DNN controller significantly improves the efficiency over traditional optimization methods by roughly two orders of magnitude.
基金Supported by the National Natural Science Foundation of China(No.U1633115)the Science and Technology Foundation of Beijing Municipal Commission of Education(No.KM201810005027)。
文摘With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity.
基金Supported by National Natural Science Foundation of China(60474035),National Research Foundation for the Doctoral Program of Higher Education of China(20050359004),Natural Science Foundation of Anhui Province(070412035)
文摘Reinforcement learning provides a cognitive science perspective to behavior and sequential decision making providedthat reinforcement learning algorithms introduce a computational concept of agency to the learning problem.Hence it addresses an abstract class of problems that can be characterized as follows: An algorithm confronted withinformation from an unknown environment is supposed to find step wise an optimal way to behave based only on somesparse, delayed or noisy feedback from some environment, that changes according to the algorithm’s behavior. Hencereinforcement learning offers an abstraction to the problem of goal-directed learning from interaction. The paper offersan opinionated introduction in the algorithmic advantages and drawbacks of several algorithmic approaches to providealgorithmic design options.
文摘Deep reinforcement learning (deep RL) has the potential to replace classic robotic controllers. State-of-the-art Deep Reinforcement algorithms such as Proximal Policy Optimization, Twin Delayed Deep Deterministic Policy Gradient and Soft Actor-Critic Reinforcement Algorithms, to mention a few, have been investigated for training robots to walk. However, conflicting performance results of these algorithms have been reported in the literature. In this work, we present the performance analysis of the above three state-of-the-art Deep Reinforcement algorithms for a constant velocity walking task on a quadruped. The performance is analyzed by simulating the walking task of a quadruped equipped with a range of sensors present on a physical quadruped robot. Simulations of the three algorithms across a range of sensor inputs and with domain randomization are performed. The strengths and weaknesses of each algorithm for the given task are discussed. We also identify a set of sensors that contribute to the best performance of each Deep Reinforcement algorithm.
文摘Aim To find a more efficient learning method based on temporal difference learning for delayed reinforcement learning tasks. Methods A kind of Q learning algorithm based on truncated TD( λ ) with adaptive schemes of λ value selection addressed to absorbing Markov decision processes was presented and implemented on computers. Results and Conclusion Simulations on the shortest path searching problems show that using adaptive λ in the Q learning based on TTD( λ ) can speed up its convergence.