Solving constrained multi-objective optimization problems with evolutionary algorithms has attracted considerable attention.Various constrained multi-objective optimization evolutionary algorithms(CMOEAs)have been dev...Solving constrained multi-objective optimization problems with evolutionary algorithms has attracted considerable attention.Various constrained multi-objective optimization evolutionary algorithms(CMOEAs)have been developed with the use of different algorithmic strategies,evolutionary operators,and constraint-handling techniques.The performance of CMOEAs may be heavily dependent on the operators used,however,it is usually difficult to select suitable operators for the problem at hand.Hence,improving operator selection is promising and necessary for CMOEAs.This work proposes an online operator selection framework assisted by Deep Reinforcement Learning.The dynamics of the population,including convergence,diversity,and feasibility,are regarded as the state;the candidate operators are considered as actions;and the improvement of the population state is treated as the reward.By using a Q-network to learn a policy to estimate the Q-values of all actions,the proposed approach can adaptively select an operator that maximizes the improvement of the population according to the current state and thereby improve the algorithmic performance.The framework is embedded into four popular CMOEAs and assessed on 42 benchmark problems.The experimental results reveal that the proposed Deep Reinforcement Learning-assisted operator selection significantly improves the performance of these CMOEAs and the resulting algorithm obtains better versatility compared to nine state-of-the-art CMOEAs.展开更多
Dynamic path planning is crucial for mobile robots to navigate successfully in unstructured envi-ronments.To achieve globally optimal path and real-time dynamic obstacle avoidance during the movement,a dynamic path pl...Dynamic path planning is crucial for mobile robots to navigate successfully in unstructured envi-ronments.To achieve globally optimal path and real-time dynamic obstacle avoidance during the movement,a dynamic path planning algorithm incorporating improved IB-RRT∗and deep reinforce-ment learning(DRL)is proposed.Firstly,an improved IB-RRT∗algorithm is proposed for global path planning by combining double elliptic subset sampling and probabilistic central circle target bi-as.Then,to tackle the slow response to dynamic obstacles and inadequate obstacle avoidance of tra-ditional local path planning algorithms,deep reinforcement learning is utilized to predict the move-ment trend of dynamic obstacles,leading to a dynamic fusion path planning.Finally,the simulation and experiment results demonstrate that the proposed improved IB-RRT∗algorithm has higher con-vergence speed and search efficiency compared with traditional Bi-RRT∗,Informed-RRT∗,and IB-RRT∗algorithms.Furthermore,the proposed fusion algorithm can effectively perform real-time obsta-cle avoidance and navigation tasks for mobile robots in unstructured environments.展开更多
In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinfor...In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.展开更多
With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time...With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity.展开更多
The active control of flow past an elliptical cylinder using the deep reinforcement learning(DRL)method is conducted.The axis ratio of the elliptical cylinderΓvaries from 1.2 to 2.0,and four angles of attackα=0°...The active control of flow past an elliptical cylinder using the deep reinforcement learning(DRL)method is conducted.The axis ratio of the elliptical cylinderΓvaries from 1.2 to 2.0,and four angles of attackα=0°,15°,30°,and 45°are taken into consideration for a fixed Reynolds number Re=100.The mass flow rates of two synthetic jets imposed on different positions of the cylinderθ1andθ2are trained to control the flow.The optimal jet placement that achieves the highest drag reduction is determined for each case.For a low axis ratio ellipse,i.e.,Γ=1.2,the controlled results atα=0°are similar to those for a circular cylinder with control jets applied atθ1=90°andθ2=270°.It is found that either applying the jets asymmetrically or increasing the angle of attack can achieve a higher drag reduction rate,which,however,is accompanied by increased fluctuation.The control jets elongate the vortex shedding,and reduce the pressure drop.Meanwhile,the flow topology is modified at a high angle of attack.For an ellipse with a relatively higher axis ratio,i.e.,Γ1.6,the drag reduction is achieved for all the angles of attack studied.The larger the angle of attack is,the higher the drag reduction ratio is.The increased fluctuation in the drag coefficient under control is encountered,regardless of the position of the control jets.The control jets modify the flow topology by inducing an external vortex near the wall,causing the drag reduction.The results suggest that the DRL can learn an active control strategy for the present configuration.展开更多
A large number of logistics operations are needed to transport fabric rolls and dye barrels to different positions in printing and dyeing plants, and increasing labor cost is making it difficult for plants to recruit ...A large number of logistics operations are needed to transport fabric rolls and dye barrels to different positions in printing and dyeing plants, and increasing labor cost is making it difficult for plants to recruit workers to complete manual operations. Artificial intelligence and robotics, which are rapidly evolving, offer potential solutions to this problem. In this paper, a navigation method dedicated to solving the issues of the inability to pass smoothly at corners in practice and local obstacle avoidance is presented. In the system, a Gaussian fitting smoothing rapid exploration random tree star-smart(GFS RRT^(*)-Smart) algorithm is proposed for global path planning and enhances the performance when the robot makes a sharp turn around corners. In local obstacle avoidance, a deep reinforcement learning determiner mixed actor critic(MAC) algorithm is used for obstacle avoidance decisions. The navigation system is implemented in a scaled-down simulation factory.展开更多
The unmanned aerial vehicle(UAV)swarm technology is one of the research hotspots in recent years.With the continuous improvement of autonomous intelligence of UAV,the swarm technology of UAV will become one of the mai...The unmanned aerial vehicle(UAV)swarm technology is one of the research hotspots in recent years.With the continuous improvement of autonomous intelligence of UAV,the swarm technology of UAV will become one of the main trends of UAV development in the future.This paper studies the behavior decision-making process of UAV swarm rendezvous task based on the double deep Q network(DDQN)algorithm.We design a guided reward function to effectively solve the problem of algorithm convergence caused by the sparse return problem in deep reinforcement learning(DRL)for the long period task.We also propose the concept of temporary storage area,optimizing the memory playback unit of the traditional DDQN algorithm,improving the convergence speed of the algorithm,and speeding up the training process of the algorithm.Different from traditional task environment,this paper establishes a continuous state-space task environment model to improve the authentication process of UAV task environment.Based on the DDQN algorithm,the collaborative tasks of UAV swarm in different task scenarios are trained.The experimental results validate that the DDQN algorithm is efficient in terms of training UAV swarm to complete the given collaborative tasks while meeting the requirements of UAV swarm for centralization and autonomy,and improving the intelligence of UAV swarm collaborative task execution.The simulation results show that after training,the proposed UAV swarm can carry out the rendezvous task well,and the success rate of the mission reaches 90%.展开更多
Mobile distributed caching(MDC)as an emerging technology has drawn attentions for its ability to shorten the distance between users and data in the wireless network.However,the DC network state in the existing work is...Mobile distributed caching(MDC)as an emerging technology has drawn attentions for its ability to shorten the distance between users and data in the wireless network.However,the DC network state in the existing work is always assumed to be either static or real-time updated.To be more realistic,a periodically updated wireless network using maximum distance separable(MDS)-coded DC is studied,in each period of which the devices may arrive and leave.For the efficient optimization of the system with large scale,this work proposes a blockchain-based cooperative deep reinforcement learning(DRL)approach,which enhances the efficiency of learning by cooperating and guarantees the security in cooperation by the practical Byzantine fault tolerance(PBFT)-based blockchain mechanism.Numerical results are presented,and it illustrates that the proposed scheme can dramatically reduce the total file download delay in DC network under the guarantee of security and efficiency.展开更多
The main function of the power communication business is to monitor,control and manage the power communication network to ensure normal and stable operation of the power communication network.Commu-nication services r...The main function of the power communication business is to monitor,control and manage the power communication network to ensure normal and stable operation of the power communication network.Commu-nication services related to dispatching data networks and the transmission of fault information or feeder automation have high requirements for delay.If processing time is prolonged,a power business cascade reaction may be triggered.In order to solve the above problems,this paper establishes an edge object-linked agent business deployment model for power communication network to unify the management of data collection,resource allocation and task scheduling within the system,realizes the virtualization of object-linked agent computing resources through Docker container technology,designs the target model of network latency and energy consumption,and introduces A3C algorithm in deep reinforcement learning,improves it according to scene characteristics,and sets corresponding optimization strategies.Mini-mize network delay and energy consumption;At the same time,to ensure that sensitive power business is handled in time,this paper designs the business dispatch model and task migration model,and solves the problem of server failure.Finally,the corresponding simulation program is designed to verify the feasibility and validity of this method,and to compare it with other existing mechanisms.展开更多
Aiming at addressing the problem of manoeuvring decision-making in UAV air combat,this study establishes a one-to-one air combat model,defines missile attack areas,and uses the non-deterministic policy Soft-Actor-Crit...Aiming at addressing the problem of manoeuvring decision-making in UAV air combat,this study establishes a one-to-one air combat model,defines missile attack areas,and uses the non-deterministic policy Soft-Actor-Critic(SAC)algorithm in deep reinforcement learning to construct a decision model to realize the manoeuvring process.At the same time,the complexity of the proposed algorithm is calculated,and the stability of the closed-loop system of air combat decision-making controlled by neural network is analysed by the Lyapunov function.This study defines the UAV air combat process as a gaming process and proposes a Parallel Self-Play training SAC algorithm(PSP-SAC)to improve the generalisation performance of UAV control decisions.Simulation results have shown that the proposed algorithm can realize sample sharing and policy sharing in multiple combat environments and can significantly improve the generalisation ability of the model compared to independent training.展开更多
Production optimization has gained increasing attention from the smart oilfield community because it can increase economic benefits and oil recovery substantially.While existing methods could produce high-optimality r...Production optimization has gained increasing attention from the smart oilfield community because it can increase economic benefits and oil recovery substantially.While existing methods could produce high-optimality results,they cannot be applied to real-time optimization for large-scale reservoirs due to high computational demands.In addition,most methods generally assume that the reservoir model is deterministic and ignore the uncertainty of the subsurface environment,making the obtained scheme unreliable for practical deployment.In this work,an efficient and robust method,namely evolutionaryassisted reinforcement learning(EARL),is proposed to achieve real-time production optimization under uncertainty.Specifically,the production optimization problem is modeled as a Markov decision process in which a reinforcement learning agent interacts with the reservoir simulator to train a control policy that maximizes the specified goals.To deal with the problems of brittle convergence properties and lack of efficient exploration strategies of reinforcement learning approaches,a population-based evolutionary algorithm is introduced to assist the training of agents,which provides diverse exploration experiences and promotes stability and robustness due to its inherent redundancy.Compared with prior methods that only optimize a solution for a particular scenario,the proposed approach trains a policy that can adapt to uncertain environments and make real-time decisions to cope with unknown changes.The trained policy,represented by a deep convolutional neural network,can adaptively adjust the well controls based on different reservoir states.Simulation results on two reservoir models show that the proposed approach not only outperforms the RL and EA methods in terms of optimization efficiency but also has strong robustness and real-time decision capacity.展开更多
The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchic...The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchical deep deterministic policy gradient(DDPG)algorithm.The reward functions are constructed to minimize the line-of-sight(LOS)angle rate and avoid the threat caused by the opposed obstacles.To attenuate the chattering of the acceleration,a hierarchical reinforcement learning structure and an improved reward function with action penalty are put forward.The simulation results validate that the missile under the proposed method can hit the target successfully and keep away from the threatened areas effectively.展开更多
We show the practicality of two existing meta-learning algorithms Model-</span></span><span><span><span> </span></span></span><span><span><span><spa...We show the practicality of two existing meta-learning algorithms Model-</span></span><span><span><span> </span></span></span><span><span><span><span style="font-family:Verdana;">Agnostic Meta-Learning and Fast Context Adaptation Via Meta-learning using an evolutionary strategy for parameter optimization, as well as propose two novel quantum adaptations of those algorithms using continuous quantum neural networks, for learning to trade portfolios of stocks on the stock market. The goal of meta-learning is to train a model on a variety of tasks, such that it can solve new learning tasks using only a small number of training samples. In our classical approach, we trained our meta-learning models on a variety of portfolios that contained 5 randomly sampled Consumer Cyclical stocks from a pool of 60. In our quantum approach, we trained our </span><span style="font-family:Verdana;">quantum meta-learning models on a simulated quantum computer with</span><span style="font-family:Verdana;"> portfolios containing 2 randomly sampled Consumer Cyclical stocks. Our findings suggest that both classical models could learn a new portfolio with 0.01% of the number of training samples to learn the original portfolios and can achieve a comparable performance within 0.1% Return on Investment of the Buy and Hold strategy. We also show that our much smaller quantum meta-learned models with only 60 model parameters and 25 training epochs </span><span style="font-family:Verdana;">have a similar learning pattern to our much larger classical meta-learned</span><span style="font-family:Verdana;"> models that have over 250,000 model parameters and 2500 training epochs. Given these findings</span></span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">,</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> we also discuss the benefits of scaling up our experiments from a simulated quantum computer to a real quantum computer. To the best of our knowledge, we are the first to apply the ideas of both classical meta-learning as well as quantum meta-learning to enhance stock trading.展开更多
Aiming at the problems of low solution accuracy and high decision pressure when facing large-scale dynamic task allocation(DTA)and high-dimensional decision space with single agent,this paper combines the deep reinfor...Aiming at the problems of low solution accuracy and high decision pressure when facing large-scale dynamic task allocation(DTA)and high-dimensional decision space with single agent,this paper combines the deep reinforce-ment learning(DRL)theory and an improved Multi-Agent Deep Deterministic Policy Gradient(MADDPG-D2)algorithm with a dual experience replay pool and a dual noise based on multi-agent architecture is proposed to improve the efficiency of DTA.The algorithm is based on the traditional Multi-Agent Deep Deterministic Policy Gradient(MADDPG)algorithm,and considers the introduction of a double noise mechanism to increase the action exploration space in the early stage of the algorithm,and the introduction of a double experience pool to improve the data utilization rate;at the same time,in order to accelerate the training speed and efficiency of the agents,and to solve the cold-start problem of the training,the a priori knowledge technology is applied to the training of the algorithm.Finally,the MADDPG-D2 algorithm is compared and analyzed based on the digital battlefield of ground and air confrontation.The experimental results show that the agents trained by the MADDPG-D2 algorithm have higher win rates and average rewards,can utilize the resources more reasonably,and better solve the problem of the traditional single agent algorithms facing the difficulty of solving the problem in the high-dimensional decision space.The MADDPG-D2 algorithm based on multi-agent architecture proposed in this paper has certain superiority and rationality in DTA.展开更多
TheUAV pursuit-evasion problem focuses on the efficient tracking and capture of evading targets using unmanned aerial vehicles(UAVs),which is pivotal in public safety applications,particularly in scenarios involving i...TheUAV pursuit-evasion problem focuses on the efficient tracking and capture of evading targets using unmanned aerial vehicles(UAVs),which is pivotal in public safety applications,particularly in scenarios involving intrusion monitoring and interception.To address the challenges of data acquisition,real-world deployment,and the limited intelligence of existing algorithms in UAV pursuit-evasion tasks,we propose an innovative swarm intelligencebased UAV pursuit-evasion control framework,namely“Boids Model-based DRL Approach for Pursuit and Escape”(Boids-PE),which synergizes the strengths of swarm intelligence from bio-inspired algorithms and deep reinforcement learning(DRL).The Boids model,which simulates collective behavior through three fundamental rules,separation,alignment,and cohesion,is adopted in our work.By integrating Boids model with the Apollonian Circles algorithm,significant improvements are achieved in capturing UAVs against simple evasion strategies.To further enhance decision-making precision,we incorporate a DRL algorithm to facilitate more accurate strategic planning.We also leverage self-play training to continuously optimize the performance of pursuit UAVs.During experimental evaluation,we meticulously designed both one-on-one and multi-to-one pursuit-evasion scenarios,customizing the state space,action space,and reward function models for each scenario.Extensive simulations,supported by the PyBullet physics engine,validate the effectiveness of our proposed method.The overall results demonstrate that Boids-PE significantly enhance the efficiency and reliability of UAV pursuit-evasion tasks,providing a practical and robust solution for the real-world application of UAV pursuit-evasion missions.展开更多
The overall performance of multi-robot collaborative systems is significantly affected by the multi-robot task allocation.To improve the effectiveness,robustness,and safety of multi-robot collaborative systems,a multi...The overall performance of multi-robot collaborative systems is significantly affected by the multi-robot task allocation.To improve the effectiveness,robustness,and safety of multi-robot collaborative systems,a multimodal multi-objective evolutionary algorithm based on deep reinforcement learning is proposed in this paper.The improved multimodal multi-objective evolutionary algorithm is used to solve multi-robot task allo-cation problems.Moreover,a deep reinforcement learning strategy is used in the last generation to provide a high-quality path for each assigned robot via an end-to-end manner.Comparisons with three popular multimodal multi-objective evolutionary algorithms on three different scenarios of multi-robot task allocation problems are carried out to verify the performance of the proposed algorithm.The experimental test results show that the proposed algorithm can generate sufficient equivalent schemes to improve the availability and robustness of multi-robot collaborative systems in uncertain environments,and also produce the best scheme to improve the overall task execution efficiency of multi-robot collaborative systems.展开更多
Deep reinforcement learning(DRL) has demonstrated significant potential in industrial manufacturing domains such as workshop scheduling and energy system management.However, due to the model's inherent uncertainty...Deep reinforcement learning(DRL) has demonstrated significant potential in industrial manufacturing domains such as workshop scheduling and energy system management.However, due to the model's inherent uncertainty, rigorous validation is requisite for its application in real-world tasks. Specific tests may reveal inadequacies in the performance of pre-trained DRL models, while the “black-box” nature of DRL poses a challenge for testing model behavior. We propose a novel performance improvement framework based on probabilistic automata,which aims to proactively identify and correct critical vulnerabilities of DRL systems, so that the performance of DRL models in real tasks can be improved with minimal model modifications.First, a probabilistic automaton is constructed from the historical trajectory of the DRL system by abstracting the state to generate probabilistic decision-making units(PDMUs), and a reverse breadth-first search(BFS) method is used to identify the key PDMU-action pairs that have the greatest impact on adverse outcomes. This process relies only on the state-action sequence and final result of each trajectory. Then, under the key PDMU, we search for the new action that has the greatest impact on favorable results. Finally, the key PDMU, undesirable action and new action are encapsulated as monitors to guide the DRL system to obtain more favorable results through real-time monitoring and correction mechanisms. Evaluations in two standard reinforcement learning environments and three actual job scheduling scenarios confirmed the effectiveness of the method, providing certain guarantees for the deployment of DRL models in real-world applications.展开更多
By leveraging the high maneuverability of the unmanned aerial vehicle(UAV) and the expansive coverage of the intelligent reflecting surface(IRS), a multi-IRS-assisted UAV communication system aimed at maximizing the s...By leveraging the high maneuverability of the unmanned aerial vehicle(UAV) and the expansive coverage of the intelligent reflecting surface(IRS), a multi-IRS-assisted UAV communication system aimed at maximizing the sum data rate of all users was investigated in this paper. This is achieved through the joint optimization of the trajectory and transmit beamforming of the UAV, as well as the passive phase shift of the IRS. Nevertheless, the initial problem exhibits a high degree of non-convexity, posing challenges for conventional mathematical optimization techniques in delivering solutions that are both quick and efficient while maintaining low complexity. To address this issue, a novel scheme called the deep reinforcement learning(DRL)-based enhanced cooperative reflection network(DCRN) was proposed. This scheme effectively identifies optimal strategies, with the long short-term memory(LSTM) network improving algorithm convergence by extracting hidden state information. Simulation results demonstrate that the proposed scheme outperforms the baseline scheme, manifesting substantial enhancements in sum rate and superior performance.展开更多
基金the National Natural Science Foundation of China(62076225,62073300)the Natural Science Foundation for Distinguished Young Scholars of Hubei(2019CFA081)。
文摘Solving constrained multi-objective optimization problems with evolutionary algorithms has attracted considerable attention.Various constrained multi-objective optimization evolutionary algorithms(CMOEAs)have been developed with the use of different algorithmic strategies,evolutionary operators,and constraint-handling techniques.The performance of CMOEAs may be heavily dependent on the operators used,however,it is usually difficult to select suitable operators for the problem at hand.Hence,improving operator selection is promising and necessary for CMOEAs.This work proposes an online operator selection framework assisted by Deep Reinforcement Learning.The dynamics of the population,including convergence,diversity,and feasibility,are regarded as the state;the candidate operators are considered as actions;and the improvement of the population state is treated as the reward.By using a Q-network to learn a policy to estimate the Q-values of all actions,the proposed approach can adaptively select an operator that maximizes the improvement of the population according to the current state and thereby improve the algorithmic performance.The framework is embedded into four popular CMOEAs and assessed on 42 benchmark problems.The experimental results reveal that the proposed Deep Reinforcement Learning-assisted operator selection significantly improves the performance of these CMOEAs and the resulting algorithm obtains better versatility compared to nine state-of-the-art CMOEAs.
基金the National Natural Science Foundation of China(No.61973275)。
文摘Dynamic path planning is crucial for mobile robots to navigate successfully in unstructured envi-ronments.To achieve globally optimal path and real-time dynamic obstacle avoidance during the movement,a dynamic path planning algorithm incorporating improved IB-RRT∗and deep reinforce-ment learning(DRL)is proposed.Firstly,an improved IB-RRT∗algorithm is proposed for global path planning by combining double elliptic subset sampling and probabilistic central circle target bi-as.Then,to tackle the slow response to dynamic obstacles and inadequate obstacle avoidance of tra-ditional local path planning algorithms,deep reinforcement learning is utilized to predict the move-ment trend of dynamic obstacles,leading to a dynamic fusion path planning.Finally,the simulation and experiment results demonstrate that the proposed improved IB-RRT∗algorithm has higher con-vergence speed and search efficiency compared with traditional Bi-RRT∗,Informed-RRT∗,and IB-RRT∗algorithms.Furthermore,the proposed fusion algorithm can effectively perform real-time obsta-cle avoidance and navigation tasks for mobile robots in unstructured environments.
文摘In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.
基金Supported by the National Natural Science Foundation of China(No.U1633115)the Science and Technology Foundation of Beijing Municipal Commission of Education(No.KM201810005027)。
文摘With the rapid development of air transportation in recent years,airport operations have attracted a lot of attention.Among them,airport gate assignment problem(AGAP)has become a research hotspot.However,the real-time AGAP algorithm is still an open issue.In this study,a deep reinforcement learning based AGAP(DRL-AGAP)is proposed.The optimization object is to maximize the rate of flights assigned to fixed gates.The real-time AGAP is modeled as a Markov decision process(MDP).The state space,action space,value and rewards have been defined.The DRL-AGAP algorithm is evaluated via simulation and it is compared with the flight pre-assignment results of the optimization software Gurobiand Greedy.Simulation results show that the performance of the proposed DRL-AGAP algorithm is close to that of pre-assignment obtained by the Gurobi optimization solver.Meanwhile,the real-time assignment ability is ensured by the proposed DRL-AGAP algorithm due to the dynamic modeling and lower complexity.
基金Project supported by the National Natural Science Foundation of China (Nos.11988102,92052201,11972220,12032016,11825204,91852202,and 11732010)the Key Research Projects of Shanghai Science and Technology Commission of China (Nos.19JC1412802 and 20ZR1419800)。
文摘The active control of flow past an elliptical cylinder using the deep reinforcement learning(DRL)method is conducted.The axis ratio of the elliptical cylinderΓvaries from 1.2 to 2.0,and four angles of attackα=0°,15°,30°,and 45°are taken into consideration for a fixed Reynolds number Re=100.The mass flow rates of two synthetic jets imposed on different positions of the cylinderθ1andθ2are trained to control the flow.The optimal jet placement that achieves the highest drag reduction is determined for each case.For a low axis ratio ellipse,i.e.,Γ=1.2,the controlled results atα=0°are similar to those for a circular cylinder with control jets applied atθ1=90°andθ2=270°.It is found that either applying the jets asymmetrically or increasing the angle of attack can achieve a higher drag reduction rate,which,however,is accompanied by increased fluctuation.The control jets elongate the vortex shedding,and reduce the pressure drop.Meanwhile,the flow topology is modified at a high angle of attack.For an ellipse with a relatively higher axis ratio,i.e.,Γ1.6,the drag reduction is achieved for all the angles of attack studied.The larger the angle of attack is,the higher the drag reduction ratio is.The increased fluctuation in the drag coefficient under control is encountered,regardless of the position of the control jets.The control jets modify the flow topology by inducing an external vortex near the wall,causing the drag reduction.The results suggest that the DRL can learn an active control strategy for the present configuration.
基金National Natural Science Foundation of China (No.61903078)。
文摘A large number of logistics operations are needed to transport fabric rolls and dye barrels to different positions in printing and dyeing plants, and increasing labor cost is making it difficult for plants to recruit workers to complete manual operations. Artificial intelligence and robotics, which are rapidly evolving, offer potential solutions to this problem. In this paper, a navigation method dedicated to solving the issues of the inability to pass smoothly at corners in practice and local obstacle avoidance is presented. In the system, a Gaussian fitting smoothing rapid exploration random tree star-smart(GFS RRT^(*)-Smart) algorithm is proposed for global path planning and enhances the performance when the robot makes a sharp turn around corners. In local obstacle avoidance, a deep reinforcement learning determiner mixed actor critic(MAC) algorithm is used for obstacle avoidance decisions. The navigation system is implemented in a scaled-down simulation factory.
基金supported by the Aeronautical Science Foundation(2017ZC53033).
文摘The unmanned aerial vehicle(UAV)swarm technology is one of the research hotspots in recent years.With the continuous improvement of autonomous intelligence of UAV,the swarm technology of UAV will become one of the main trends of UAV development in the future.This paper studies the behavior decision-making process of UAV swarm rendezvous task based on the double deep Q network(DDQN)algorithm.We design a guided reward function to effectively solve the problem of algorithm convergence caused by the sparse return problem in deep reinforcement learning(DRL)for the long period task.We also propose the concept of temporary storage area,optimizing the memory playback unit of the traditional DDQN algorithm,improving the convergence speed of the algorithm,and speeding up the training process of the algorithm.Different from traditional task environment,this paper establishes a continuous state-space task environment model to improve the authentication process of UAV task environment.Based on the DDQN algorithm,the collaborative tasks of UAV swarm in different task scenarios are trained.The experimental results validate that the DDQN algorithm is efficient in terms of training UAV swarm to complete the given collaborative tasks while meeting the requirements of UAV swarm for centralization and autonomy,and improving the intelligence of UAV swarm collaborative task execution.The simulation results show that after training,the proposed UAV swarm can carry out the rendezvous task well,and the success rate of the mission reaches 90%.
基金Supported by the National Natural Science Foundation of China(No.61571021,61901011)the Program of China Scholarship Council(No.201806540039).
文摘Mobile distributed caching(MDC)as an emerging technology has drawn attentions for its ability to shorten the distance between users and data in the wireless network.However,the DC network state in the existing work is always assumed to be either static or real-time updated.To be more realistic,a periodically updated wireless network using maximum distance separable(MDS)-coded DC is studied,in each period of which the devices may arrive and leave.For the efficient optimization of the system with large scale,this work proposes a blockchain-based cooperative deep reinforcement learning(DRL)approach,which enhances the efficiency of learning by cooperating and guarantees the security in cooperation by the practical Byzantine fault tolerance(PBFT)-based blockchain mechanism.Numerical results are presented,and it illustrates that the proposed scheme can dramatically reduce the total file download delay in DC network under the guarantee of security and efficiency.
基金funded by the“Research on Digitization and Intelligent Application of Low-Voltage Power Distribution Equipment”[SGSDDK00PDJS2000375]。
文摘The main function of the power communication business is to monitor,control and manage the power communication network to ensure normal and stable operation of the power communication network.Commu-nication services related to dispatching data networks and the transmission of fault information or feeder automation have high requirements for delay.If processing time is prolonged,a power business cascade reaction may be triggered.In order to solve the above problems,this paper establishes an edge object-linked agent business deployment model for power communication network to unify the management of data collection,resource allocation and task scheduling within the system,realizes the virtualization of object-linked agent computing resources through Docker container technology,designs the target model of network latency and energy consumption,and introduces A3C algorithm in deep reinforcement learning,improves it according to scene characteristics,and sets corresponding optimization strategies.Mini-mize network delay and energy consumption;At the same time,to ensure that sensitive power business is handled in time,this paper designs the business dispatch model and task migration model,and solves the problem of server failure.Finally,the corresponding simulation program is designed to verify the feasibility and validity of this method,and to compare it with other existing mechanisms.
基金National Natural Science Foundation of China,Grant/Award Number:62003267Fundamental Research Funds for the Central Universities,Grant/Award Number:G2022KY0602+1 种基金Technology on Electromagnetic Space Operations and Applications Laboratory,Grant/Award Number:2022ZX0090Key Core Technology Research Plan of Xi'an,Grant/Award Number:21RGZN0016。
文摘Aiming at addressing the problem of manoeuvring decision-making in UAV air combat,this study establishes a one-to-one air combat model,defines missile attack areas,and uses the non-deterministic policy Soft-Actor-Critic(SAC)algorithm in deep reinforcement learning to construct a decision model to realize the manoeuvring process.At the same time,the complexity of the proposed algorithm is calculated,and the stability of the closed-loop system of air combat decision-making controlled by neural network is analysed by the Lyapunov function.This study defines the UAV air combat process as a gaming process and proposes a Parallel Self-Play training SAC algorithm(PSP-SAC)to improve the generalisation performance of UAV control decisions.Simulation results have shown that the proposed algorithm can realize sample sharing and policy sharing in multiple combat environments and can significantly improve the generalisation ability of the model compared to independent training.
基金This work is supported by the National Natural Science Foundation of China under Grant 52274057,52074340 and 51874335the Major Scientific and Technological Projects of CNPC under Grant ZD2019-183-008the Science and Technology Support Plan for Youth Innovation of University in Shandong Province under Grant 2019KJH002,111 Project under Grant B08028.
文摘Production optimization has gained increasing attention from the smart oilfield community because it can increase economic benefits and oil recovery substantially.While existing methods could produce high-optimality results,they cannot be applied to real-time optimization for large-scale reservoirs due to high computational demands.In addition,most methods generally assume that the reservoir model is deterministic and ignore the uncertainty of the subsurface environment,making the obtained scheme unreliable for practical deployment.In this work,an efficient and robust method,namely evolutionaryassisted reinforcement learning(EARL),is proposed to achieve real-time production optimization under uncertainty.Specifically,the production optimization problem is modeled as a Markov decision process in which a reinforcement learning agent interacts with the reservoir simulator to train a control policy that maximizes the specified goals.To deal with the problems of brittle convergence properties and lack of efficient exploration strategies of reinforcement learning approaches,a population-based evolutionary algorithm is introduced to assist the training of agents,which provides diverse exploration experiences and promotes stability and robustness due to its inherent redundancy.Compared with prior methods that only optimize a solution for a particular scenario,the proposed approach trains a policy that can adapt to uncertain environments and make real-time decisions to cope with unknown changes.The trained policy,represented by a deep convolutional neural network,can adaptively adjust the well controls based on different reservoir states.Simulation results on two reservoir models show that the proposed approach not only outperforms the RL and EA methods in terms of optimization efficiency but also has strong robustness and real-time decision capacity.
基金supported by the National Natural Science Foundation of China(62003021,91212304).
文摘The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchical deep deterministic policy gradient(DDPG)algorithm.The reward functions are constructed to minimize the line-of-sight(LOS)angle rate and avoid the threat caused by the opposed obstacles.To attenuate the chattering of the acceleration,a hierarchical reinforcement learning structure and an improved reward function with action penalty are put forward.The simulation results validate that the missile under the proposed method can hit the target successfully and keep away from the threatened areas effectively.
文摘We show the practicality of two existing meta-learning algorithms Model-</span></span><span><span><span> </span></span></span><span><span><span><span style="font-family:Verdana;">Agnostic Meta-Learning and Fast Context Adaptation Via Meta-learning using an evolutionary strategy for parameter optimization, as well as propose two novel quantum adaptations of those algorithms using continuous quantum neural networks, for learning to trade portfolios of stocks on the stock market. The goal of meta-learning is to train a model on a variety of tasks, such that it can solve new learning tasks using only a small number of training samples. In our classical approach, we trained our meta-learning models on a variety of portfolios that contained 5 randomly sampled Consumer Cyclical stocks from a pool of 60. In our quantum approach, we trained our </span><span style="font-family:Verdana;">quantum meta-learning models on a simulated quantum computer with</span><span style="font-family:Verdana;"> portfolios containing 2 randomly sampled Consumer Cyclical stocks. Our findings suggest that both classical models could learn a new portfolio with 0.01% of the number of training samples to learn the original portfolios and can achieve a comparable performance within 0.1% Return on Investment of the Buy and Hold strategy. We also show that our much smaller quantum meta-learned models with only 60 model parameters and 25 training epochs </span><span style="font-family:Verdana;">have a similar learning pattern to our much larger classical meta-learned</span><span style="font-family:Verdana;"> models that have over 250,000 model parameters and 2500 training epochs. Given these findings</span></span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">,</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> we also discuss the benefits of scaling up our experiments from a simulated quantum computer to a real quantum computer. To the best of our knowledge, we are the first to apply the ideas of both classical meta-learning as well as quantum meta-learning to enhance stock trading.
基金This research was funded by the Project of the National Natural Science Foundation of China,Grant Number 62106283.
文摘Aiming at the problems of low solution accuracy and high decision pressure when facing large-scale dynamic task allocation(DTA)and high-dimensional decision space with single agent,this paper combines the deep reinforce-ment learning(DRL)theory and an improved Multi-Agent Deep Deterministic Policy Gradient(MADDPG-D2)algorithm with a dual experience replay pool and a dual noise based on multi-agent architecture is proposed to improve the efficiency of DTA.The algorithm is based on the traditional Multi-Agent Deep Deterministic Policy Gradient(MADDPG)algorithm,and considers the introduction of a double noise mechanism to increase the action exploration space in the early stage of the algorithm,and the introduction of a double experience pool to improve the data utilization rate;at the same time,in order to accelerate the training speed and efficiency of the agents,and to solve the cold-start problem of the training,the a priori knowledge technology is applied to the training of the algorithm.Finally,the MADDPG-D2 algorithm is compared and analyzed based on the digital battlefield of ground and air confrontation.The experimental results show that the agents trained by the MADDPG-D2 algorithm have higher win rates and average rewards,can utilize the resources more reasonably,and better solve the problem of the traditional single agent algorithms facing the difficulty of solving the problem in the high-dimensional decision space.The MADDPG-D2 algorithm based on multi-agent architecture proposed in this paper has certain superiority and rationality in DTA.
文摘TheUAV pursuit-evasion problem focuses on the efficient tracking and capture of evading targets using unmanned aerial vehicles(UAVs),which is pivotal in public safety applications,particularly in scenarios involving intrusion monitoring and interception.To address the challenges of data acquisition,real-world deployment,and the limited intelligence of existing algorithms in UAV pursuit-evasion tasks,we propose an innovative swarm intelligencebased UAV pursuit-evasion control framework,namely“Boids Model-based DRL Approach for Pursuit and Escape”(Boids-PE),which synergizes the strengths of swarm intelligence from bio-inspired algorithms and deep reinforcement learning(DRL).The Boids model,which simulates collective behavior through three fundamental rules,separation,alignment,and cohesion,is adopted in our work.By integrating Boids model with the Apollonian Circles algorithm,significant improvements are achieved in capturing UAVs against simple evasion strategies.To further enhance decision-making precision,we incorporate a DRL algorithm to facilitate more accurate strategic planning.We also leverage self-play training to continuously optimize the performance of pursuit UAVs.During experimental evaluation,we meticulously designed both one-on-one and multi-to-one pursuit-evasion scenarios,customizing the state space,action space,and reward function models for each scenario.Extensive simulations,supported by the PyBullet physics engine,validate the effectiveness of our proposed method.The overall results demonstrate that Boids-PE significantly enhance the efficiency and reliability of UAV pursuit-evasion tasks,providing a practical and robust solution for the real-world application of UAV pursuit-evasion missions.
基金the Shanghai Pujiang Program (No.22PJD030),the National Natural Science Foundation of China (Nos.61603244 and 71904116)the National Natural Science Foundation of China-Shandong Joint Fund (No.U2006228)。
文摘The overall performance of multi-robot collaborative systems is significantly affected by the multi-robot task allocation.To improve the effectiveness,robustness,and safety of multi-robot collaborative systems,a multimodal multi-objective evolutionary algorithm based on deep reinforcement learning is proposed in this paper.The improved multimodal multi-objective evolutionary algorithm is used to solve multi-robot task allo-cation problems.Moreover,a deep reinforcement learning strategy is used in the last generation to provide a high-quality path for each assigned robot via an end-to-end manner.Comparisons with three popular multimodal multi-objective evolutionary algorithms on three different scenarios of multi-robot task allocation problems are carried out to verify the performance of the proposed algorithm.The experimental test results show that the proposed algorithm can generate sufficient equivalent schemes to improve the availability and robustness of multi-robot collaborative systems in uncertain environments,and also produce the best scheme to improve the overall task execution efficiency of multi-robot collaborative systems.
基金supported by the Shanghai Science and Technology Committee (22511105500)the National Nature Science Foundation of China (62172299, 62032019)+2 种基金the Space Optoelectronic Measurement and Perception LaboratoryBeijing Institute of Control Engineering(LabSOMP-2023-03)the Central Universities of China (2023-4-YB-05)。
文摘Deep reinforcement learning(DRL) has demonstrated significant potential in industrial manufacturing domains such as workshop scheduling and energy system management.However, due to the model's inherent uncertainty, rigorous validation is requisite for its application in real-world tasks. Specific tests may reveal inadequacies in the performance of pre-trained DRL models, while the “black-box” nature of DRL poses a challenge for testing model behavior. We propose a novel performance improvement framework based on probabilistic automata,which aims to proactively identify and correct critical vulnerabilities of DRL systems, so that the performance of DRL models in real tasks can be improved with minimal model modifications.First, a probabilistic automaton is constructed from the historical trajectory of the DRL system by abstracting the state to generate probabilistic decision-making units(PDMUs), and a reverse breadth-first search(BFS) method is used to identify the key PDMU-action pairs that have the greatest impact on adverse outcomes. This process relies only on the state-action sequence and final result of each trajectory. Then, under the key PDMU, we search for the new action that has the greatest impact on favorable results. Finally, the key PDMU, undesirable action and new action are encapsulated as monitors to guide the DRL system to obtain more favorable results through real-time monitoring and correction mechanisms. Evaluations in two standard reinforcement learning environments and three actual job scheduling scenarios confirmed the effectiveness of the method, providing certain guarantees for the deployment of DRL models in real-world applications.
基金supported by the Second Batch of Industry University Cooperation and Collaborative Education Projects of the Ministry of Education (202102123021)the Sichuan Key Provincial Research Base of Intelligent Tourism (ZHZJ22-01)the Graduate Innovation Fund of Sichuan University of Science and Engineering (Y2023107)。
文摘By leveraging the high maneuverability of the unmanned aerial vehicle(UAV) and the expansive coverage of the intelligent reflecting surface(IRS), a multi-IRS-assisted UAV communication system aimed at maximizing the sum data rate of all users was investigated in this paper. This is achieved through the joint optimization of the trajectory and transmit beamforming of the UAV, as well as the passive phase shift of the IRS. Nevertheless, the initial problem exhibits a high degree of non-convexity, posing challenges for conventional mathematical optimization techniques in delivering solutions that are both quick and efficient while maintaining low complexity. To address this issue, a novel scheme called the deep reinforcement learning(DRL)-based enhanced cooperative reflection network(DCRN) was proposed. This scheme effectively identifies optimal strategies, with the long short-term memory(LSTM) network improving algorithm convergence by extracting hidden state information. Simulation results demonstrate that the proposed scheme outperforms the baseline scheme, manifesting substantial enhancements in sum rate and superior performance.