We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled ...We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.展开更多
The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to...The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.展开更多
To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This pape...To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments.展开更多
To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model wit...To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.展开更多
Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical o...Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical operation,there are still some problems with federated learning applications.Blockchain has the characteristics of decentralization,distribution,and security.The blockchain-enabled federated learning further improve the security and performance of model training,while also expanding the application scope of federated learning.Blockchain has natural financial attributes that help establish a federated learning data market.However,the data of federated learning tasks may be distributed across a large number of resource-constrained IoT devices,which have different computing,communication,and storage resources,and the data quality of each device may also vary.Therefore,how to effectively select the clients with the data required for federated learning task is a research hotspot.In this paper,a two-stage client selection scheme for blockchain-enabled federated learning is proposed,which first selects clients that satisfy federated learning task through attribute-based encryption,protecting the attribute privacy of clients.Then blockchain nodes select some clients for local model aggregation by proximal policy optimization algorithm.Experiments show that the model performance of our two-stage client selection scheme is higher than that of other client selection algorithms when some clients are offline and the data quality is poor.展开更多
Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehi...Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehicle charging provides inherent flexibility that can be used to compensate for the intermittency of photo-voltaic generation and optimize against fluctuating electricity prices.Exploiting this flexibility,however,requires smart control algorithms capable of handling uncertainties from photo-voltaic generation,electric vehicle energy demand and user’s behaviour.This paper proposes a control framework combining the advantages of reinforcement learning and rule-based control to coordinate the charging of a fleet of electric vehicles in an office building.The control objective is to maximize self-consumption of locally generated electricity and consequently,minimize the electricity cost of electric vehicle charging.The performance of the proposed framework is evaluated on a real-world data set from EnergyVille,a Belgian research institute.Simulation results show that the proposed control framework achieves a 62.5%electricity cost reduction compared to a business-as-usual or passive charging strategy.In addition,only a 5%performance gap is achieved in comparison to a theoretical near-optimal strategy that assumes perfect knowledge on the required energy and user behaviour of each electric vehicle.展开更多
In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement l...In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement learning to enhance these capabilities for critical missions like target search,positioning,and relay guidance.Structured on a dual-layer model,the algorithm’s lower layer manages basic aircraft maneuvers for optimal flight,while the upper layer processes battlefield dynamics,issuing precise navigational commands.This approach enables accurate navigation and effective reconnaissance for lead aircraft.Notably,our Hierarchical Prior-augmented Proximal Policy Optimization(HPE-PPO)algorithm employs a prior-based training,prior-free execution method,accelerating target positioning training and ensuring robust target reacquisition.This paper also improves missile relay guidance and promotes the effective guidance.By integrating this system with a human-piloted lead aircraft,this paper proposes a potent solution for cooperative aerial warfare.Rigorous experiments demonstrate enhanced survivability and efficiency of loyal wingmen,marking a significant contribution to Unmanned Aerial Vehicles(UAV)formation control research.This advancement is poised to drive substantial interest and progress in the related technological fields.展开更多
SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to m...SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to meet the observation requirements of various payloads.Due to the limitation of platform power supply and data storage systems,proposing reasonable mission planning schemes to improve scientific revenue of the payloads becomes a critical issue.In this article,we formulate the integrated task scheduling of SATech-01 as a multi-objective optimization problem and propose a novel Fair Integrated Scheduling with Proximal Policy Optimization(FIS-PPO)algorithm to solve it.We use multiple decision heads to generate decisions for each task and design the action mask to ensure the schedule meeting the platform constraints.Experimental results show that FIS-PPO could push the capability of the platform to the limit and improve the overall observation efficiency by 31.5%compared to rule-based plans currently used.Moreover,fairness is considered in the reward design and our method achieves much better performance in terms of equal task opportunities.Because of its low computational complexity,our task scheduling algorithm has the potential to be directly deployed on board for real-time task scheduling in future space projects.展开更多
With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real ...With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations.展开更多
Modern power systems are experiencing larger fluctuations and more uncertainties caused by increased penetration of renewable energy sources(RESs) and power electronics equipment. Therefore, fast and accurate correcti...Modern power systems are experiencing larger fluctuations and more uncertainties caused by increased penetration of renewable energy sources(RESs) and power electronics equipment. Therefore, fast and accurate corrective control actions in real time are needed to ensure the system security and economics. This paper presents a novel method to derive realtime alternating current(AC) optimal power flow(OPF) solutions considering the uncertainties including varying renewable energy and topology changes by using state-of-the-art deep reinforcement learning(DRL) algorithm, which can effectively assist grid operators in making rapid and effective real-time decisions. The presented DRL-based approach first adopts a supervised-learning method from deep learning to generate good initial weights for neural networks, and then the proximal policy optimization(PPO) algorithm is applied to train and test the artificial intelligence(AI) agents for stable and robust performance. An ancillary classifier is designed to identify the feasibility of the AC OPF problem. Case studies conducted on the Illinois 200-bus system with wind generation variation and N-1 topology changes validate the effectiveness of the proposed method and demonstrate its great potential in promoting sustainable energy integration into the power system.展开更多
With the booming of electric vehicles(EVs) across the world, their increasing charging demands pose challenges to urban distribution networks. Particularly, due to the further implementation of time-of-use prices, the...With the booming of electric vehicles(EVs) across the world, their increasing charging demands pose challenges to urban distribution networks. Particularly, due to the further implementation of time-of-use prices, the charging behaviors of household EVs are concentrated on low-cost periods, thus generating new load peaks and affecting the secure operation of the medium-and low-voltage grids. This problem is particularly acute in many old communities with relatively poor electricity infrastructure. In this paper, a novel two-stage charging scheduling scheme based on deep reinforcement learning is proposed to improve the power quality and achieve optimal charging scheduling of household EVs simultaneously in active distribution network(ADN) during valley period. In the first stage, the optimal charging profiles of charging stations are determined by solving the optimal power flow with the objective of eliminating peak-valley load differences. In the second stage, an intelligent agent based on proximal policy optimization algorithm is developed to dispatch the household EVs sequentially within the low-cost period considering their discrete nature of arrival. Through powerful approximation of neural network, the challenge of imperfect knowledge is tackled effectively during the charging scheduling process. Finally, numerical results demonstrate that the proposed scheme exhibits great improvement in relieving peak-valley differences as well as improving voltage quality in the ADN.展开更多
The optimal dispatch methods of integrated energy systems(IESs) currently struggle to address the uncertainties resulting from renewable energy generation and energy demand. Moreover, the increasing intensity of the g...The optimal dispatch methods of integrated energy systems(IESs) currently struggle to address the uncertainties resulting from renewable energy generation and energy demand. Moreover, the increasing intensity of the greenhouse effect renders the reduction of IES carbon emissions a priority. To address these issues, a deep reinforcement learning(DRL)-based method is proposed to optimize the low-carbon economic dispatch model of an electricity-heat-gas IES. In the DRL framework, the optimal dispatch model of the IES is formulated as a Markov decision process(MDP). A reward function based on the reward-penalty ladder-type carbon trading mechanism(RPLT-CTM) is introduced to enable the DRL agents to learn more effective dispatch strategies. Moreover, a distributed proximal policy optimization(DPPO) algorithm, which is a novel policy-based DRL algorithm, is employed to train the DRL agents. The multithreaded architecture enhances the exploration ability of the DRL agents in complex environments. Experimental results illustrate that the proposed DPPO-based IES dispatch method can mitigate carbon emissions and reduce the total economic cost. The RPLT-CTM-based reward function outperforms the CTM-based methods, providing a 4.42% and 6.41% decrease in operating cost and carbon emission, respectively. Furthermore, the superiority and computational efficiency of DPPO compared with other DRL-based methods are demonstrated by a decrease of more than 1.53% and 3.23% in the operating cost and carbon emissions of the IES, respectively.展开更多
With the rising extension of renewable energies, the intraday electricity markets have recorded a growingpopularity amongst traders as well as electric utilities to cope with the induced volatility of the energysupply...With the rising extension of renewable energies, the intraday electricity markets have recorded a growingpopularity amongst traders as well as electric utilities to cope with the induced volatility of the energysupply. Through their short trading horizon and continuous nature, the intraday markets offer the abilityto adjust trading decisions from the day-ahead market or reduce trading risk in a short-term notice. Producersof renewable energies utilize the intraday market to lower their forecast risk, by modifying their providedcapacities based on current forecasts. However, the market dynamics are complex due to the fact that thepower grids have to remain stable and electricity is only partly storable. Consequently, robust and intelligenttrading strategies are required that are capable to operate in the intraday market. In this work, we proposea novel autonomous trading approach based on Deep Reinforcement Learning (DRL) algorithms as a possiblesolution. For this purpose, we model the intraday trade as a Markov Decision Process (MDP) and employ theProximal Policy Optimization (PPO) algorithm as our DRL approach. A simulation framework is introducedthat enables the trading of the continuous intraday price in a resolution of one minute steps. We test ourframework in a case study from the perspective of a wind park operator. We include next to general tradeinformation both price and wind forecasts. On a test scenario of German intraday trading results from 2018,we are able to outperform multiple baselines with at least 45.24% improvement, showing the advantage of theDRL algorithm. However, we also discuss limitations and enhancements of the DRL agent, in order to increasethe performance in future works.展开更多
This paper develops a real-time control method based on deep reinforcement learning aimed to determine the optimal control actions to maintain a sufficient secure operating limit.The secure operating limit refers to t...This paper develops a real-time control method based on deep reinforcement learning aimed to determine the optimal control actions to maintain a sufficient secure operating limit.The secure operating limit refers to the limit to the most stressed pre-contingency operating point of an electric power system that can withstand a set of credible contingencies without violating stability criteria.The developed deep reinforcement learning method uses a hybrid control scheme that is capable of simultaneously adjusting both discrete and continuous action variables.The performance is evaluated on a modified version of the Nordic32 test system.The results show that the developed deep reinforcement learning method quickly learns an effective control policy to ensure a sufficient secure operating limit for a range of different system scenarios.The performance is also compared to a control based on a rule-based look-up table and a deep reinforcement learning control adapted for discrete action spaces.The hybrid deep reinforcement learning control managed to achieve significantly better on all of the defined test sets,indicating that the possibility of adjusting both discrete and continuous action variables resulted in a more flexible and efficient control policy.展开更多
基金Project supported by the National Key R&D Program of China(No.2018AAA0101400)the National Natural Science Foundation of China(Nos.61973074,U1713209,61520106009,and 61533008)+1 种基金the Science and Technology on Information System Engineering Laboratory(No.05201902)the Fundamental Research Funds for the Central Universities,China。
文摘We use the advanced proximal policy optimization(PPO)reinforcement learning algorithm to optimize the stochastic control strategy to achieve speed control of the"model-free"quadrotor.The model is controlled by four learned neural networks,which directly map the system states to control commands in an end-to-end style.By introducing an integral compensator into the actor-critic framework,the speed tracking accuracy and robustness have been greatly enhanced.In addition,a two-phase learning scheme which includes both offline-and online-learning is developed for practical use.A model with strong generalization ability is learned in the offline phase.Then,the flight policy of the model is continuously optimized in the online learning phase.Finally,the performances of our proposed algorithm are compared with those of the traditional PID algorithm.
基金the Project of National Natural Science Foundation of China(Grant No.62106283)the Project of National Natural Science Foundation of China(Grant No.72001214)to provide fund for conducting experimentsthe Project of Natural Science Foundation of Shaanxi Province(Grant No.2020JQ-484)。
文摘The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.
基金supported in part by the National Natural Science Foundation of China under grants 61901078,61771082,61871062,and U20A20157in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under grant KJQN201900609+2 种基金in part by the Natural Science Foundation of Chongqing under grant cstc2020jcyj-zdxmX0024in part by University Innovation Research Group of Chongqing under grant CXQT20017in part by the China University Industry-University-Research Collaborative Innovation Fund(Future Network Innovation Research and Application Project)under grant 2021FNA04008.
文摘To guarantee the heterogeneous delay requirements of the diverse vehicular services,it is necessary to design a full cooperative policy for both Vehicle to Infrastructure(V2I)and Vehicle to Vehicle(V2V)links.This paper investigates the reduction of the delay in edge information sharing for V2V links while satisfying the delay requirements of the V2I links.Specifically,a mean delay minimization problem and a maximum individual delay minimization problem are formulated to improve the global network performance and ensure the fairness of a single user,respectively.A multi-agent reinforcement learning framework is designed to solve these two problems,where a new reward function is proposed to evaluate the utilities of the two optimization objectives in a unified framework.Thereafter,a proximal policy optimization approach is proposed to enable each V2V user to learn its policy using the shared global network reward.The effectiveness of the proposed approach is finally validated by comparing the obtained results with those of the other baseline approaches through extensive simulation experiments.
基金financial support from National Natural Science Foundation of China(Grant No.61601491)Natural Science Foundation of Hubei Province,China(Grant No.2018CFC865)Military Research Project of China(-Grant No.YJ2020B117)。
文摘To solve the problem of multi-target hunting by an unmanned surface vehicle(USV)fleet,a hunting algorithm based on multi-agent reinforcement learning is proposed.Firstly,the hunting environment and kinematic model without boundary constraints are built,and the criteria for successful target capture are given.Then,the cooperative hunting problem of a USV fleet is modeled as a decentralized partially observable Markov decision process(Dec-POMDP),and a distributed partially observable multitarget hunting Proximal Policy Optimization(DPOMH-PPO)algorithm applicable to USVs is proposed.In addition,an observation model,a reward function and the action space applicable to multi-target hunting tasks are designed.To deal with the dynamic change of observational feature dimension input by partially observable systems,a feature embedding block is proposed.By combining the two feature compression methods of column-wise max pooling(CMP)and column-wise average-pooling(CAP),observational feature encoding is established.Finally,the centralized training and decentralized execution framework is adopted to complete the training of hunting strategy.Each USV in the fleet shares the same policy and perform actions independently.Simulation experiments have verified the effectiveness of the DPOMH-PPO algorithm in the test scenarios with different numbers of USVs.Moreover,the advantages of the proposed model are comprehensively analyzed from the aspects of algorithm performance,migration effect in task scenarios and self-organization capability after being damaged,the potential deployment and application of DPOMH-PPO in the real environment is verified.
文摘Federated learning enables data owners in the Internet of Things(IoT)to collaborate in training models without sharing private data,creating new business opportunities for building a data market.However,in practical operation,there are still some problems with federated learning applications.Blockchain has the characteristics of decentralization,distribution,and security.The blockchain-enabled federated learning further improve the security and performance of model training,while also expanding the application scope of federated learning.Blockchain has natural financial attributes that help establish a federated learning data market.However,the data of federated learning tasks may be distributed across a large number of resource-constrained IoT devices,which have different computing,communication,and storage resources,and the data quality of each device may also vary.Therefore,how to effectively select the clients with the data required for federated learning task is a research hotspot.In this paper,a two-stage client selection scheme for blockchain-enabled federated learning is proposed,which first selects clients that satisfy federated learning task through attribute-based encryption,protecting the attribute privacy of clients.Then blockchain nodes select some clients for local model aggregation by proximal policy optimization algorithm.Experiments show that the model performance of our two-stage client selection scheme is higher than that of other client selection algorithms when some clients are offline and the data quality is poor.
文摘Recent years have seen a significant increase in the adoption of electric vehicles,and investments in electric vehicle charging infrastructure and rooftop photo-voltaic installations.The ability to delay electric vehicle charging provides inherent flexibility that can be used to compensate for the intermittency of photo-voltaic generation and optimize against fluctuating electricity prices.Exploiting this flexibility,however,requires smart control algorithms capable of handling uncertainties from photo-voltaic generation,electric vehicle energy demand and user’s behaviour.This paper proposes a control framework combining the advantages of reinforcement learning and rule-based control to coordinate the charging of a fleet of electric vehicles in an office building.The control objective is to maximize self-consumption of locally generated electricity and consequently,minimize the electricity cost of electric vehicle charging.The performance of the proposed framework is evaluated on a real-world data set from EnergyVille,a Belgian research institute.Simulation results show that the proposed control framework achieves a 62.5%electricity cost reduction compared to a business-as-usual or passive charging strategy.In addition,only a 5%performance gap is achieved in comparison to a theoretical near-optimal strategy that assumes perfect knowledge on the required energy and user behaviour of each electric vehicle.
基金This study was co-supported by the Natural Science Basic Research Program of Shaanxi,China(No.2022JQ-593)the Key R&D Program of Shaanxi Provincial Department of Science and Technology,China(No.2022GY-089)the Aeronautical Science Foundation of China(No.20220013053005).
文摘In modern Beyond-Visual-Range(BVR)aerial combat,unmanned loyal wingmen are pivotal,yet their autonomous capabilities are limited.Our study introduces an advanced control algorithm based on hierarchical reinforcement learning to enhance these capabilities for critical missions like target search,positioning,and relay guidance.Structured on a dual-layer model,the algorithm’s lower layer manages basic aircraft maneuvers for optimal flight,while the upper layer processes battlefield dynamics,issuing precise navigational commands.This approach enables accurate navigation and effective reconnaissance for lead aircraft.Notably,our Hierarchical Prior-augmented Proximal Policy Optimization(HPE-PPO)algorithm employs a prior-based training,prior-free execution method,accelerating target positioning training and ensuring robust target reacquisition.This paper also improves missile relay guidance and promotes the effective guidance.By integrating this system with a human-piloted lead aircraft,this paper proposes a potent solution for cooperative aerial warfare.Rigorous experiments demonstrate enhanced survivability and efficiency of loyal wingmen,marking a significant contribution to Unmanned Aerial Vehicles(UAV)formation control research.This advancement is poised to drive substantial interest and progress in the related technological fields.
基金supported by the Strategic Priority Program on Space Science,Chinese Academy of Sciences。
文摘SATech-01 is an experimental satellite for space science exploration and on-orbit demonstration of advanced technologies.The satellite is equipped with 16 experimental payloads and supports multiple working modes to meet the observation requirements of various payloads.Due to the limitation of platform power supply and data storage systems,proposing reasonable mission planning schemes to improve scientific revenue of the payloads becomes a critical issue.In this article,we formulate the integrated task scheduling of SATech-01 as a multi-objective optimization problem and propose a novel Fair Integrated Scheduling with Proximal Policy Optimization(FIS-PPO)algorithm to solve it.We use multiple decision heads to generate decisions for each task and design the action mask to ensure the schedule meeting the platform constraints.Experimental results show that FIS-PPO could push the capability of the platform to the limit and improve the overall observation efficiency by 31.5%compared to rule-based plans currently used.Moreover,fairness is considered in the reward design and our method achieves much better performance in terms of equal task opportunities.Because of its low computational complexity,our task scheduling algorithm has the potential to be directly deployed on board for real-time task scheduling in future space projects.
基金supported by State Grid Science and Technology Program“Research on Real-time Autonomous Control Strategies for Power Grid Based on AI Technologies”(No.5700-201958523A-0-0-00)
文摘With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations.
文摘Modern power systems are experiencing larger fluctuations and more uncertainties caused by increased penetration of renewable energy sources(RESs) and power electronics equipment. Therefore, fast and accurate corrective control actions in real time are needed to ensure the system security and economics. This paper presents a novel method to derive realtime alternating current(AC) optimal power flow(OPF) solutions considering the uncertainties including varying renewable energy and topology changes by using state-of-the-art deep reinforcement learning(DRL) algorithm, which can effectively assist grid operators in making rapid and effective real-time decisions. The presented DRL-based approach first adopts a supervised-learning method from deep learning to generate good initial weights for neural networks, and then the proximal policy optimization(PPO) algorithm is applied to train and test the artificial intelligence(AI) agents for stable and robust performance. An ancillary classifier is designed to identify the feasibility of the AC OPF problem. Case studies conducted on the Illinois 200-bus system with wind generation variation and N-1 topology changes validate the effectiveness of the proposed method and demonstrate its great potential in promoting sustainable energy integration into the power system.
基金supported by the National Key R&D Program of China (No.2021ZD0112700)the Key Science and Technology Project of China Southern Power Grid Corporation (No.090000k52210134)。
文摘With the booming of electric vehicles(EVs) across the world, their increasing charging demands pose challenges to urban distribution networks. Particularly, due to the further implementation of time-of-use prices, the charging behaviors of household EVs are concentrated on low-cost periods, thus generating new load peaks and affecting the secure operation of the medium-and low-voltage grids. This problem is particularly acute in many old communities with relatively poor electricity infrastructure. In this paper, a novel two-stage charging scheduling scheme based on deep reinforcement learning is proposed to improve the power quality and achieve optimal charging scheduling of household EVs simultaneously in active distribution network(ADN) during valley period. In the first stage, the optimal charging profiles of charging stations are determined by solving the optimal power flow with the objective of eliminating peak-valley load differences. In the second stage, an intelligent agent based on proximal policy optimization algorithm is developed to dispatch the household EVs sequentially within the low-cost period considering their discrete nature of arrival. Through powerful approximation of neural network, the challenge of imperfect knowledge is tackled effectively during the charging scheduling process. Finally, numerical results demonstrate that the proposed scheme exhibits great improvement in relieving peak-valley differences as well as improving voltage quality in the ADN.
基金supported in part by the National Natural Science Foundation of China (No.61102124)。
文摘The optimal dispatch methods of integrated energy systems(IESs) currently struggle to address the uncertainties resulting from renewable energy generation and energy demand. Moreover, the increasing intensity of the greenhouse effect renders the reduction of IES carbon emissions a priority. To address these issues, a deep reinforcement learning(DRL)-based method is proposed to optimize the low-carbon economic dispatch model of an electricity-heat-gas IES. In the DRL framework, the optimal dispatch model of the IES is formulated as a Markov decision process(MDP). A reward function based on the reward-penalty ladder-type carbon trading mechanism(RPLT-CTM) is introduced to enable the DRL agents to learn more effective dispatch strategies. Moreover, a distributed proximal policy optimization(DPPO) algorithm, which is a novel policy-based DRL algorithm, is employed to train the DRL agents. The multithreaded architecture enhances the exploration ability of the DRL agents in complex environments. Experimental results illustrate that the proposed DPPO-based IES dispatch method can mitigate carbon emissions and reduce the total economic cost. The RPLT-CTM-based reward function outperforms the CTM-based methods, providing a 4.42% and 6.41% decrease in operating cost and carbon emission, respectively. Furthermore, the superiority and computational efficiency of DPPO compared with other DRL-based methods are demonstrated by a decrease of more than 1.53% and 3.23% in the operating cost and carbon emissions of the IES, respectively.
文摘With the rising extension of renewable energies, the intraday electricity markets have recorded a growingpopularity amongst traders as well as electric utilities to cope with the induced volatility of the energysupply. Through their short trading horizon and continuous nature, the intraday markets offer the abilityto adjust trading decisions from the day-ahead market or reduce trading risk in a short-term notice. Producersof renewable energies utilize the intraday market to lower their forecast risk, by modifying their providedcapacities based on current forecasts. However, the market dynamics are complex due to the fact that thepower grids have to remain stable and electricity is only partly storable. Consequently, robust and intelligenttrading strategies are required that are capable to operate in the intraday market. In this work, we proposea novel autonomous trading approach based on Deep Reinforcement Learning (DRL) algorithms as a possiblesolution. For this purpose, we model the intraday trade as a Markov Decision Process (MDP) and employ theProximal Policy Optimization (PPO) algorithm as our DRL approach. A simulation framework is introducedthat enables the trading of the continuous intraday price in a resolution of one minute steps. We test ourframework in a case study from the perspective of a wind park operator. We include next to general tradeinformation both price and wind forecasts. On a test scenario of German intraday trading results from 2018,we are able to outperform multiple baselines with at least 45.24% improvement, showing the advantage of theDRL algorithm. However, we also discuss limitations and enhancements of the DRL agent, in order to increasethe performance in future works.
文摘This paper develops a real-time control method based on deep reinforcement learning aimed to determine the optimal control actions to maintain a sufficient secure operating limit.The secure operating limit refers to the limit to the most stressed pre-contingency operating point of an electric power system that can withstand a set of credible contingencies without violating stability criteria.The developed deep reinforcement learning method uses a hybrid control scheme that is capable of simultaneously adjusting both discrete and continuous action variables.The performance is evaluated on a modified version of the Nordic32 test system.The results show that the developed deep reinforcement learning method quickly learns an effective control policy to ensure a sufficient secure operating limit for a range of different system scenarios.The performance is also compared to a control based on a rule-based look-up table and a deep reinforcement learning control adapted for discrete action spaces.The hybrid deep reinforcement learning control managed to achieve significantly better on all of the defined test sets,indicating that the possibility of adjusting both discrete and continuous action variables resulted in a more flexible and efficient control policy.