Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metavers...Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metaverses. However, avatar tasks include a multitude of human-to-avatar and avatar-to-avatar interactive applications, e.g., augmented reality navigation,which consumes intensive computing resources. It is inefficient and impractical for vehicles to process avatar tasks locally. Fortunately, migrating avatar tasks to the nearest roadside units(RSU)or unmanned aerial vehicles(UAV) for execution is a promising solution to decrease computation overhead and reduce task processing latency, while the high mobility of vehicles brings challenges for vehicles to independently perform avatar migration decisions depending on current and future vehicle status. To address these challenges, in this paper, we propose a novel avatar task migration system based on multi-agent deep reinforcement learning(MADRL) to execute immersive vehicular avatar tasks dynamically. Specifically, we first formulate the problem of avatar task migration from vehicles to RSUs/UAVs as a partially observable Markov decision process that can be solved by MADRL algorithms. We then design the multi-agent proximal policy optimization(MAPPO) approach as the MADRL algorithm for the avatar task migration problem. To overcome slow convergence resulting from the curse of dimensionality and non-stationary issues caused by shared parameters in MAPPO, we further propose a transformer-based MAPPO approach via sequential decision-making models for the efficient representation of relationships among agents. Finally, to motivate terrestrial or non-terrestrial edge servers(e.g., RSUs or UAVs) to share computation resources and ensure traceability of the sharing records, we apply smart contracts and blockchain technologies to achieve secure sharing management. Numerical results demonstrate that the proposed approach outperforms the MAPPO approach by around 2% and effectively reduces approximately 20% of the latency of avatar task execution in UAV-assisted vehicular Metaverses.展开更多
The gasoline inline blending process has widely used real-time optimization techniques to achieve optimization objectives,such as minimizing the cost of production.However,the effectiveness of real-time optimization i...The gasoline inline blending process has widely used real-time optimization techniques to achieve optimization objectives,such as minimizing the cost of production.However,the effectiveness of real-time optimization in gasoline blending relies on accurate blending models and is challenged by stochastic disturbances.Thus,we propose a real-time optimization algorithm based on the soft actor-critic(SAC)deep reinforcement learning strategy to optimize gasoline blending without relying on a single blending model and to be robust against disturbances.Our approach constructs the environment using nonlinear blending models and feedstocks with disturbances.The algorithm incorporates the Lagrange multiplier and path constraints in reward design to manage sparse product constraints.Carefully abstracted states facilitate algorithm convergence,and the normalized action vector in each optimization period allows the agent to generalize to some extent across different target production scenarios.Through these well-designed components,the algorithm based on the SAC outperforms real-time optimization methods based on either nonlinear or linear programming.It even demonstrates comparable performance with the time-horizon based real-time optimization method,which requires knowledge of uncertainty models,confirming its capability to handle uncertainty without accurate models.Our simulation illustrates a promising approach to free real-time optimization of the gasoline blending process from uncertainty models that are difficult to acquire in practice.展开更多
In the traditional well log depth matching tasks,manual adjustments are required,which means significantly labor-intensive for multiple wells,leading to low work efficiency.This paper introduces a multi-agent deep rei...In the traditional well log depth matching tasks,manual adjustments are required,which means significantly labor-intensive for multiple wells,leading to low work efficiency.This paper introduces a multi-agent deep reinforcement learning(MARL)method to automate the depth matching of multi-well logs.This method defines multiple top-down dual sliding windows based on the convolutional neural network(CNN)to extract and capture similar feature sequences on well logs,and it establishes an interaction mechanism between agents and the environment to control the depth matching process.Specifically,the agent selects an action to translate or scale the feature sequence based on the double deep Q-network(DDQN).Through the feedback of the reward signal,it evaluates the effectiveness of each action,aiming to obtain the optimal strategy and improve the accuracy of the matching task.Our experiments show that MARL can automatically perform depth matches for well-logs in multiple wells,and reduce manual intervention.In the application to the oil field,a comparative analysis of dynamic time warping(DTW),deep Q-learning network(DQN),and DDQN methods revealed that the DDQN algorithm,with its dual-network evaluation mechanism,significantly improves performance by identifying and aligning more details in the well log feature sequences,thus achieving higher depth matching accuracy.展开更多
Solving constrained multi-objective optimization problems with evolutionary algorithms has attracted considerable attention.Various constrained multi-objective optimization evolutionary algorithms(CMOEAs)have been dev...Solving constrained multi-objective optimization problems with evolutionary algorithms has attracted considerable attention.Various constrained multi-objective optimization evolutionary algorithms(CMOEAs)have been developed with the use of different algorithmic strategies,evolutionary operators,and constraint-handling techniques.The performance of CMOEAs may be heavily dependent on the operators used,however,it is usually difficult to select suitable operators for the problem at hand.Hence,improving operator selection is promising and necessary for CMOEAs.This work proposes an online operator selection framework assisted by Deep Reinforcement Learning.The dynamics of the population,including convergence,diversity,and feasibility,are regarded as the state;the candidate operators are considered as actions;and the improvement of the population state is treated as the reward.By using a Q-network to learn a policy to estimate the Q-values of all actions,the proposed approach can adaptively select an operator that maximizes the improvement of the population according to the current state and thereby improve the algorithmic performance.The framework is embedded into four popular CMOEAs and assessed on 42 benchmark problems.The experimental results reveal that the proposed Deep Reinforcement Learning-assisted operator selection significantly improves the performance of these CMOEAs and the resulting algorithm obtains better versatility compared to nine state-of-the-art CMOEAs.展开更多
In this paper,we propose the Two-way Deep Reinforcement Learning(DRL)-Based resource allocation algorithm,which solves the problem of resource allocation in the cognitive downlink network based on the underlay mode.Se...In this paper,we propose the Two-way Deep Reinforcement Learning(DRL)-Based resource allocation algorithm,which solves the problem of resource allocation in the cognitive downlink network based on the underlay mode.Secondary users(SUs)in the cognitive network are multiplexed by a new Power Domain Sparse Code Multiple Access(PD-SCMA)scheme,and the physical resources of the cognitive base station are virtualized into two types of slices:enhanced mobile broadband(eMBB)slice and ultrareliable low latency communication(URLLC)slice.We design the Double Deep Q Network(DDQN)network output the optimal codebook assignment scheme and simultaneously use the Deep Deterministic Policy Gradient(DDPG)network output the optimal power allocation scheme.The objective is to jointly optimize the spectral efficiency of the system and the Quality of Service(QoS)of SUs.Simulation results show that the proposed algorithm outperforms the CNDDQN algorithm and modified JEERA algorithm in terms of spectral efficiency and QoS satisfaction.Additionally,compared with the Power Domain Non-orthogonal Multiple Access(PD-NOMA)slices and the Sparse Code Multiple Access(SCMA)slices,the PD-SCMA slices can dramatically enhance spectral efficiency and increase the number of accessible users.展开更多
Autonomous umanned aerial vehicle(UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battlefield. A large amount of research has been devo...Autonomous umanned aerial vehicle(UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battlefield. A large amount of research has been devoted to improving the autonomous decision-making ability of UAV in an interactive environment, where finding the optimal maneuvering decisionmaking policy became one of the key issues for enabling the intelligence of UAV. In this paper, we propose a maneuvering decision-making algorithm for autonomous air-delivery based on deep reinforcement learning under the guidance of expert experience. Specifically, we refine the guidance towards area and guidance towards specific point tasks for the air-delivery process based on the traditional air-to-surface fire control methods.Moreover, we construct the UAV maneuvering decision-making model based on Markov decision processes(MDPs). Specifically, we present a reward shaping method for the guidance towards area and guidance towards specific point tasks using potential-based function and expert-guided advice. The proposed algorithm could accelerate the convergence of the maneuvering decision-making policy and increase the stability of the policy in terms of the output during the later stage of training process. The effectiveness of the proposed maneuvering decision-making policy is illustrated by the curves of training parameters and extensive experimental results for testing the trained policy.展开更多
As the demands of massive connections and vast coverage rapidly grow in the next wireless communication networks, rate splitting multiple access(RSMA) is considered to be the new promising access scheme since it can p...As the demands of massive connections and vast coverage rapidly grow in the next wireless communication networks, rate splitting multiple access(RSMA) is considered to be the new promising access scheme since it can provide higher efficiency with limited spectrum resources. In this paper, combining spectrum splitting with rate splitting, we propose to allocate resources with traffic offloading in hybrid satellite terrestrial networks. A novel deep reinforcement learning method is adopted to solve this challenging non-convex problem. However, the neverending learning process could prohibit its practical implementation. Therefore, we introduce the switch mechanism to avoid unnecessary learning. Additionally, the QoS constraint in the scheme can rule out unsuccessful transmission. The simulation results validates the energy efficiency performance and the convergence speed of the proposed algorithm.展开更多
The Multi-access Edge Cloud(MEC) networks extend cloud computing services and capabilities to the edge of the networks. By bringing computation and storage capabilities closer to end-users and connected devices, MEC n...The Multi-access Edge Cloud(MEC) networks extend cloud computing services and capabilities to the edge of the networks. By bringing computation and storage capabilities closer to end-users and connected devices, MEC networks can support a wide range of applications. MEC networks can also leverage various types of resources, including computation resources, network resources, radio resources,and location-based resources, to provide multidimensional resources for intelligent applications in 5/6G.However, tasks generated by users often consist of multiple subtasks that require different types of resources. It is a challenging problem to offload multiresource task requests to the edge cloud aiming at maximizing benefits due to the heterogeneity of resources provided by devices. To address this issue,we mathematically model the task requests with multiple subtasks. Then, the problem of task offloading of multi-resource task requests is proved to be NP-hard. Furthermore, we propose a novel Dual-Agent Deep Reinforcement Learning algorithm with Node First and Link features(NF_L_DA_DRL) based on the policy network, to optimize the benefits generated by offloading multi-resource task requests in MEC networks. Finally, simulation results show that the proposed algorithm can effectively improve the benefit of task offloading with higher resource utilization compared with baseline algorithms.展开更多
Emerging mobile edge computing(MEC)is considered a feasible solution for offloading the computation-intensive request tasks generated from mobile wireless equipment(MWE)with limited computational resources and energy....Emerging mobile edge computing(MEC)is considered a feasible solution for offloading the computation-intensive request tasks generated from mobile wireless equipment(MWE)with limited computational resources and energy.Due to the homogeneity of request tasks from one MWE during a longterm time period,it is vital to predeploy the particular service cachings required by the request tasks at the MEC server.In this paper,we model a service caching-assisted MEC framework that takes into account the constraint on the number of service cachings hosted by each edge server and the migration of request tasks from the current edge server to another edge server with service caching required by tasks.Furthermore,we propose a multiagent deep reinforcement learning-based computation offloading and task migrating decision-making scheme(MBOMS)to minimize the long-term average weighted cost.The proposed MBOMS can learn the near-optimal offloading and migrating decision-making policy by centralized training and decentralized execution.Systematic and comprehensive simulation results reveal that our proposed MBOMS can converge well after training and outperforms the other five baseline algorithms.展开更多
Existing researches on cyber attackdefense analysis have typically adopted stochastic game theory to model the problem for solutions,but the assumption of complete rationality is used in modeling,ignoring the informat...Existing researches on cyber attackdefense analysis have typically adopted stochastic game theory to model the problem for solutions,but the assumption of complete rationality is used in modeling,ignoring the information opacity in practical attack and defense scenarios,and the model and method lack accuracy.To such problem,we investigate network defense policy methods under finite rationality constraints and propose network defense policy selection algorithm based on deep reinforcement learning.Based on graph theoretical methods,we transform the decision-making problem into a path optimization problem,and use a compression method based on service node to map the network state.On this basis,we improve the A3C algorithm and design the DefenseA3C defense policy selection algorithm with online learning capability.The experimental results show that the model and method proposed in this paper can stably converge to a better network state after training,which is faster and more stable than the original A3C algorithm.Compared with the existing typical approaches,Defense-A3C is verified its advancement.展开更多
To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront thechallenge of managing the surging demand for data traffic. Within this realm, the network imposes stringentQu...To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront thechallenge of managing the surging demand for data traffic. Within this realm, the network imposes stringentQuality of Service (QoS) requirements, revealing the inadequacies of traditional routing allocation mechanismsin accommodating such extensive data flows. In response to the imperative of handling a substantial influx of datarequests promptly and alleviating the constraints of existing technologies and network congestion, we present anarchitecture forQoS routing optimizationwith in SoftwareDefinedNetwork (SDN), leveraging deep reinforcementlearning. This innovative approach entails the separation of SDN control and transmission functionalities, centralizingcontrol over data forwardingwhile integrating deep reinforcement learning for informed routing decisions. Byfactoring in considerations such as delay, bandwidth, jitter rate, and packet loss rate, we design a reward function toguide theDeepDeterministic PolicyGradient (DDPG) algorithmin learning the optimal routing strategy to furnishsuperior QoS provision. In our empirical investigations, we juxtapose the performance of Deep ReinforcementLearning (DRL) against that of Shortest Path (SP) algorithms in terms of data packet transmission delay. Theexperimental simulation results show that our proposed algorithm has significant efficacy in reducing networkdelay and improving the overall transmission efficiency, which is superior to the traditional methods.展开更多
Policy evaluation(PE)is a critical sub-problem in reinforcement learning,which estimates the value function for a given policy and can be used for policy improvement.However,there still exist some limitations in curre...Policy evaluation(PE)is a critical sub-problem in reinforcement learning,which estimates the value function for a given policy and can be used for policy improvement.However,there still exist some limitations in current PE methods,such as low sample efficiency and local convergence,especially on complex tasks.In this study,a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning(LST2D)is proposed.In LST2D,an adaptive truncation mechanism is designed,which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning(TD).Then,two feature pre-training methods are utilised to improve the approximation ability of LST2D.Furthermore,an Actor-Critic algorithm based on LST2D and pre-trained feature representations(ACLPF)is proposed,where LST2D is integrated into the critic network to improve learning-prediction efficiency.Comprehensive simulation studies were conducted on four robotic tasks,and the corresponding results illustrate the effectiveness of LST2D.The proposed ACLPF algorithm outperformed DQN,ACER and PPO in terms of sample efficiency and stability,which demonstrated that LST2D can be applied to online learning control problems by incorporating it into the actor-critic architecture.展开更多
Image description task is the intersection of computer vision and natural language processing,and it has important prospects,including helping computers understand images and obtaining information for the visually imp...Image description task is the intersection of computer vision and natural language processing,and it has important prospects,including helping computers understand images and obtaining information for the visually impaired.This study presents an innovative approach employing deep reinforcement learning to enhance the accuracy of natural language descriptions of images.Our method focuses on refining the reward function in deep reinforcement learning,facilitating the generation of precise descriptions by aligning visual and textual features more closely.Our approach comprises three key architectures.Firstly,it utilizes Residual Network 101(ResNet-101)and Faster Region-based Convolutional Neural Network(Faster R-CNN)to extract average and local image features,respectively,followed by the implementation of a dual attention mechanism for intricate feature fusion.Secondly,the Transformer model is engaged to derive contextual semantic features from textual data.Finally,the generation of descriptive text is executed through a two-layer long short-term memory network(LSTM),directed by the value and reward functions.Compared with the image description method that relies on deep learning,the score of Bilingual Evaluation Understudy(BLEU-1)is 0.762,which is 1.6%higher,and the score of BLEU-4 is 0.299.Consensus-based Image Description Evaluation(CIDEr)scored 0.998,Recall-Oriented Understudy for Gisting Evaluation(ROUGE)scored 0.552,the latter improved by 0.36%.These results not only attest to the viability of our approach but also highlight its superiority in the realm of image description.Future research can explore the integration of our method with other artificial intelligence(AI)domains,such as emotional AI,to create more nuanced and context-aware systems.展开更多
Traditional optimal scheduling methods are limited to accurate physical models and parameter settings, which aredifficult to adapt to the uncertainty of source and load, and there are problems such as the inability to...Traditional optimal scheduling methods are limited to accurate physical models and parameter settings, which aredifficult to adapt to the uncertainty of source and load, and there are problems such as the inability to make dynamicdecisions continuously. This paper proposed a dynamic economic scheduling method for distribution networksbased on deep reinforcement learning. Firstly, the economic scheduling model of the new energy distributionnetwork is established considering the action characteristics of micro-gas turbines, and the dynamic schedulingmodel based on deep reinforcement learning is constructed for the new energy distribution network system with ahigh proportion of new energy, and the Markov decision process of the model is defined. Secondly, Second, for thechanging characteristics of source-load uncertainty, agents are trained interactively with the distributed networkin a data-driven manner. Then, through the proximal policy optimization algorithm, agents adaptively learn thescheduling strategy and realize the dynamic scheduling decision of the new energy distribution network system.Finally, the feasibility and superiority of the proposed method are verified by an improved IEEE 33-node simulationsystem.展开更多
Formany years,researchers have explored power allocation(PA)algorithms driven bymodels in wireless networks where multiple-user communications with interference are present.Nowadays,data-driven machine learning method...Formany years,researchers have explored power allocation(PA)algorithms driven bymodels in wireless networks where multiple-user communications with interference are present.Nowadays,data-driven machine learning methods have become quite popular in analyzing wireless communication systems,which among them deep reinforcement learning(DRL)has a significant role in solving optimization issues under certain constraints.To this purpose,in this paper,we investigate the PA problem in a k-user multiple access channels(MAC),where k transmitters(e.g.,mobile users)aim to send an independent message to a common receiver(e.g.,base station)through wireless channels.To this end,we first train the deep Q network(DQN)with a deep Q learning(DQL)algorithm over the simulation environment,utilizing offline learning.Then,the DQN will be used with the real data in the online training method for the PA issue by maximizing the sumrate subjected to the source power.Finally,the simulation results indicate that our proposedDQNmethod provides better performance in terms of the sumrate compared with the available DQL training approaches such as fractional programming(FP)and weighted minimum mean squared error(WMMSE).Additionally,by considering different user densities,we show that our proposed DQN outperforms benchmark algorithms,thereby,a good generalization ability is verified over wireless multi-user communication systems.展开更多
To solve the sparse reward problem of job-shop scheduling by deep reinforcement learning,a deep reinforcement learning framework considering sparse reward problem is proposed.The job shop scheduling problem is transfo...To solve the sparse reward problem of job-shop scheduling by deep reinforcement learning,a deep reinforcement learning framework considering sparse reward problem is proposed.The job shop scheduling problem is transformed into Markov decision process,and six state features are designed to improve the state feature representation by using two-way scheduling method,including four state features that distinguish the optimal action and two state features that are related to the learning goal.An extended variant of graph isomorphic network GIN++is used to encode disjunction graphs to improve the performance and generalization ability of the model.Through iterative greedy algorithm,random strategy is generated as the initial strategy,and the action with the maximum information gain is selected to expand it to optimize the exploration ability of Actor-Critic algorithm.Through validation of the trained policy model on multiple public test data sets and comparison with other advanced DRL methods and scheduling rules,the proposed method reduces the minimum average gap by 3.49%,5.31%and 4.16%,respectively,compared with the priority rule-based method,and 5.34%compared with the learning-based method.11.97%and 5.02%,effectively improving the accuracy of DRL to solve the approximate solution of JSSP minimum completion time.展开更多
Network-assisted full duplex(NAFD)cellfree(CF)massive MIMO has drawn increasing attention in 6G evolvement.In this paper,we build an NAFD CF system in which the users and access points(APs)can flexibly select their du...Network-assisted full duplex(NAFD)cellfree(CF)massive MIMO has drawn increasing attention in 6G evolvement.In this paper,we build an NAFD CF system in which the users and access points(APs)can flexibly select their duplex modes to increase the link spectral efficiency.Then we formulate a joint flexible duplexing and power allocation problem to balance the user fairness and system spectral efficiency.We further transform the problem into a probability optimization to accommodate the shortterm communications.In contrast with the instant performance optimization,the probability optimization belongs to a sequential decision making problem,and thus we reformulate it as a Markov Decision Process(MDP).We utilizes deep reinforcement learning(DRL)algorithm to search the solution from a large state-action space,and propose an asynchronous advantage actor-critic(A3C)-based scheme to reduce the chance of converging to the suboptimal policy.Simulation results demonstrate that the A3C-based scheme is superior to the baseline schemes in term of the complexity,accumulated log spectral efficiency,and stability.展开更多
The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to...The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.展开更多
Limited by battery and computing re-sources,the computing-intensive tasks generated by Internet of Things(IoT)devices cannot be processed all by themselves.Mobile edge computing(MEC)is a suitable solution for this pro...Limited by battery and computing re-sources,the computing-intensive tasks generated by Internet of Things(IoT)devices cannot be processed all by themselves.Mobile edge computing(MEC)is a suitable solution for this problem,and the gener-ated tasks can be offloaded from IoT devices to MEC.In this paper,we study the problem of dynamic task offloading for digital twin-empowered MEC.Digital twin techniques are applied to provide information of environment and share the training data of agent de-ployed on IoT devices.We formulate the task offload-ing problem with the goal of maximizing the energy efficiency and the workload balance among the ESs.Then,we reformulate the problem as an MDP problem and design DRL-based energy efficient task offloading(DEETO)algorithm to solve it.Comparative experi-ments are carried out which show the superiority of our DEETO algorithm in improving energy efficiency and balancing the workload.展开更多
Flexible adaptation to differentiated quality of service(QoS)is quite important for future 6G network with a variety of services.Mobile ad hoc networks(MANETs)are able to provide flexible communication services to use...Flexible adaptation to differentiated quality of service(QoS)is quite important for future 6G network with a variety of services.Mobile ad hoc networks(MANETs)are able to provide flexible communication services to users through self-configuration and rapid deployment.However,the dynamic wireless environment,the limited resources,and complex QoS requirements have presented great challenges for network routing problems.Motivated by the development of artificial intelligence,a deep reinforcement learning-based collaborative routing(DRLCR)algorithm is proposed.Both routing policy and subchannel allocation are considered jointly,aiming at minimizing the end-to-end(E2E)delay and improving the network capacity.After sufficient training by the cluster head node,the Q-network can be synchronized to each member node to select the next hop based on local observation.Moreover,we improve the performance of training by considering historical observations,which can improve the adaptability of routing policies to dynamic environments.Simulation results show that the proposed DRLCR algorithm outperforms other algorithms in terms of resource utilization and E2E delay by optimizing network load to avoid congestion.In addition,the effectiveness of the routing policy in a dynamic environment is verified.展开更多
基金supported in part by NSFC (62102099, U22A2054, 62101594)in part by the Pearl River Talent Recruitment Program (2021QN02S643)+9 种基金Guangzhou Basic Research Program (2023A04J1699)in part by the National Research Foundation, SingaporeInfocomm Media Development Authority under its Future Communications Research Development ProgrammeDSO National Laboratories under the AI Singapore Programme under AISG Award No AISG2-RP-2020-019Energy Research Test-Bed and Industry Partnership Funding Initiative, Energy Grid (EG) 2.0 programmeDesCartes and the Campus for Research Excellence and Technological Enterprise (CREATE) programmeMOE Tier 1 under Grant RG87/22in part by the Singapore University of Technology and Design (SUTD) (SRG-ISTD-2021- 165)in part by the SUTD-ZJU IDEA Grant SUTD-ZJU (VP) 202102in part by the Ministry of Education, Singapore, through its SUTD Kickstarter Initiative (SKI 20210204)。
文摘Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metaverses. However, avatar tasks include a multitude of human-to-avatar and avatar-to-avatar interactive applications, e.g., augmented reality navigation,which consumes intensive computing resources. It is inefficient and impractical for vehicles to process avatar tasks locally. Fortunately, migrating avatar tasks to the nearest roadside units(RSU)or unmanned aerial vehicles(UAV) for execution is a promising solution to decrease computation overhead and reduce task processing latency, while the high mobility of vehicles brings challenges for vehicles to independently perform avatar migration decisions depending on current and future vehicle status. To address these challenges, in this paper, we propose a novel avatar task migration system based on multi-agent deep reinforcement learning(MADRL) to execute immersive vehicular avatar tasks dynamically. Specifically, we first formulate the problem of avatar task migration from vehicles to RSUs/UAVs as a partially observable Markov decision process that can be solved by MADRL algorithms. We then design the multi-agent proximal policy optimization(MAPPO) approach as the MADRL algorithm for the avatar task migration problem. To overcome slow convergence resulting from the curse of dimensionality and non-stationary issues caused by shared parameters in MAPPO, we further propose a transformer-based MAPPO approach via sequential decision-making models for the efficient representation of relationships among agents. Finally, to motivate terrestrial or non-terrestrial edge servers(e.g., RSUs or UAVs) to share computation resources and ensure traceability of the sharing records, we apply smart contracts and blockchain technologies to achieve secure sharing management. Numerical results demonstrate that the proposed approach outperforms the MAPPO approach by around 2% and effectively reduces approximately 20% of the latency of avatar task execution in UAV-assisted vehicular Metaverses.
基金supported by National Key Research & Development Program-Intergovernmental International Science and Technology Innovation Cooperation Project (2021YFE0112800)National Natural Science Foundation of China (Key Program: 62136003)+2 种基金National Natural Science Foundation of China (62073142)Fundamental Research Funds for the Central Universities (222202417006)Shanghai Al Lab
文摘The gasoline inline blending process has widely used real-time optimization techniques to achieve optimization objectives,such as minimizing the cost of production.However,the effectiveness of real-time optimization in gasoline blending relies on accurate blending models and is challenged by stochastic disturbances.Thus,we propose a real-time optimization algorithm based on the soft actor-critic(SAC)deep reinforcement learning strategy to optimize gasoline blending without relying on a single blending model and to be robust against disturbances.Our approach constructs the environment using nonlinear blending models and feedstocks with disturbances.The algorithm incorporates the Lagrange multiplier and path constraints in reward design to manage sparse product constraints.Carefully abstracted states facilitate algorithm convergence,and the normalized action vector in each optimization period allows the agent to generalize to some extent across different target production scenarios.Through these well-designed components,the algorithm based on the SAC outperforms real-time optimization methods based on either nonlinear or linear programming.It even demonstrates comparable performance with the time-horizon based real-time optimization method,which requires knowledge of uncertainty models,confirming its capability to handle uncertainty without accurate models.Our simulation illustrates a promising approach to free real-time optimization of the gasoline blending process from uncertainty models that are difficult to acquire in practice.
基金Supported by the China National Petroleum Corporation Limited-China University of Petroleum(Beijing)Strategic Cooperation Science and Technology Project(ZLZX2020-03).
文摘In the traditional well log depth matching tasks,manual adjustments are required,which means significantly labor-intensive for multiple wells,leading to low work efficiency.This paper introduces a multi-agent deep reinforcement learning(MARL)method to automate the depth matching of multi-well logs.This method defines multiple top-down dual sliding windows based on the convolutional neural network(CNN)to extract and capture similar feature sequences on well logs,and it establishes an interaction mechanism between agents and the environment to control the depth matching process.Specifically,the agent selects an action to translate or scale the feature sequence based on the double deep Q-network(DDQN).Through the feedback of the reward signal,it evaluates the effectiveness of each action,aiming to obtain the optimal strategy and improve the accuracy of the matching task.Our experiments show that MARL can automatically perform depth matches for well-logs in multiple wells,and reduce manual intervention.In the application to the oil field,a comparative analysis of dynamic time warping(DTW),deep Q-learning network(DQN),and DDQN methods revealed that the DDQN algorithm,with its dual-network evaluation mechanism,significantly improves performance by identifying and aligning more details in the well log feature sequences,thus achieving higher depth matching accuracy.
基金the National Natural Science Foundation of China(62076225,62073300)the Natural Science Foundation for Distinguished Young Scholars of Hubei(2019CFA081)。
文摘Solving constrained multi-objective optimization problems with evolutionary algorithms has attracted considerable attention.Various constrained multi-objective optimization evolutionary algorithms(CMOEAs)have been developed with the use of different algorithmic strategies,evolutionary operators,and constraint-handling techniques.The performance of CMOEAs may be heavily dependent on the operators used,however,it is usually difficult to select suitable operators for the problem at hand.Hence,improving operator selection is promising and necessary for CMOEAs.This work proposes an online operator selection framework assisted by Deep Reinforcement Learning.The dynamics of the population,including convergence,diversity,and feasibility,are regarded as the state;the candidate operators are considered as actions;and the improvement of the population state is treated as the reward.By using a Q-network to learn a policy to estimate the Q-values of all actions,the proposed approach can adaptively select an operator that maximizes the improvement of the population according to the current state and thereby improve the algorithmic performance.The framework is embedded into four popular CMOEAs and assessed on 42 benchmark problems.The experimental results reveal that the proposed Deep Reinforcement Learning-assisted operator selection significantly improves the performance of these CMOEAs and the resulting algorithm obtains better versatility compared to nine state-of-the-art CMOEAs.
基金supported by the National Natural Science Foundation of China(Grant No.61971057).
文摘In this paper,we propose the Two-way Deep Reinforcement Learning(DRL)-Based resource allocation algorithm,which solves the problem of resource allocation in the cognitive downlink network based on the underlay mode.Secondary users(SUs)in the cognitive network are multiplexed by a new Power Domain Sparse Code Multiple Access(PD-SCMA)scheme,and the physical resources of the cognitive base station are virtualized into two types of slices:enhanced mobile broadband(eMBB)slice and ultrareliable low latency communication(URLLC)slice.We design the Double Deep Q Network(DDQN)network output the optimal codebook assignment scheme and simultaneously use the Deep Deterministic Policy Gradient(DDPG)network output the optimal power allocation scheme.The objective is to jointly optimize the spectral efficiency of the system and the Quality of Service(QoS)of SUs.Simulation results show that the proposed algorithm outperforms the CNDDQN algorithm and modified JEERA algorithm in terms of spectral efficiency and QoS satisfaction.Additionally,compared with the Power Domain Non-orthogonal Multiple Access(PD-NOMA)slices and the Sparse Code Multiple Access(SCMA)slices,the PD-SCMA slices can dramatically enhance spectral efficiency and increase the number of accessible users.
基金supported by the Key Research and Development Program of Shaanxi (2022GXLH-02-09)the Aeronautical Science Foundation of China (20200051053001)the Natural Science Basic Research Program of Shaanxi (2020JM-147)。
文摘Autonomous umanned aerial vehicle(UAV) manipulation is necessary for the defense department to execute tactical missions given by commanders in the future unmanned battlefield. A large amount of research has been devoted to improving the autonomous decision-making ability of UAV in an interactive environment, where finding the optimal maneuvering decisionmaking policy became one of the key issues for enabling the intelligence of UAV. In this paper, we propose a maneuvering decision-making algorithm for autonomous air-delivery based on deep reinforcement learning under the guidance of expert experience. Specifically, we refine the guidance towards area and guidance towards specific point tasks for the air-delivery process based on the traditional air-to-surface fire control methods.Moreover, we construct the UAV maneuvering decision-making model based on Markov decision processes(MDPs). Specifically, we present a reward shaping method for the guidance towards area and guidance towards specific point tasks using potential-based function and expert-guided advice. The proposed algorithm could accelerate the convergence of the maneuvering decision-making policy and increase the stability of the policy in terms of the output during the later stage of training process. The effectiveness of the proposed maneuvering decision-making policy is illustrated by the curves of training parameters and extensive experimental results for testing the trained policy.
文摘As the demands of massive connections and vast coverage rapidly grow in the next wireless communication networks, rate splitting multiple access(RSMA) is considered to be the new promising access scheme since it can provide higher efficiency with limited spectrum resources. In this paper, combining spectrum splitting with rate splitting, we propose to allocate resources with traffic offloading in hybrid satellite terrestrial networks. A novel deep reinforcement learning method is adopted to solve this challenging non-convex problem. However, the neverending learning process could prohibit its practical implementation. Therefore, we introduce the switch mechanism to avoid unnecessary learning. Additionally, the QoS constraint in the scheme can rule out unsuccessful transmission. The simulation results validates the energy efficiency performance and the convergence speed of the proposed algorithm.
基金supported in part by the National Natural Science Foundation of China under Grants 62201105,62331017,and 62075024in part by the Natural Science Foundation of Chongqing under Grant cstc2021jcyj-msxmX0404+1 种基金in part by the Chongqing Municipal Education Commission under Grant KJQN202100643in part by Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515110056.
文摘The Multi-access Edge Cloud(MEC) networks extend cloud computing services and capabilities to the edge of the networks. By bringing computation and storage capabilities closer to end-users and connected devices, MEC networks can support a wide range of applications. MEC networks can also leverage various types of resources, including computation resources, network resources, radio resources,and location-based resources, to provide multidimensional resources for intelligent applications in 5/6G.However, tasks generated by users often consist of multiple subtasks that require different types of resources. It is a challenging problem to offload multiresource task requests to the edge cloud aiming at maximizing benefits due to the heterogeneity of resources provided by devices. To address this issue,we mathematically model the task requests with multiple subtasks. Then, the problem of task offloading of multi-resource task requests is proved to be NP-hard. Furthermore, we propose a novel Dual-Agent Deep Reinforcement Learning algorithm with Node First and Link features(NF_L_DA_DRL) based on the policy network, to optimize the benefits generated by offloading multi-resource task requests in MEC networks. Finally, simulation results show that the proposed algorithm can effectively improve the benefit of task offloading with higher resource utilization compared with baseline algorithms.
基金supported by Jilin Provincial Science and Technology Department Natural Science Foundation of China(20210101415JC)Jilin Provincial Science and Technology Department Free exploration research project of China(YDZJ202201ZYTS642).
文摘Emerging mobile edge computing(MEC)is considered a feasible solution for offloading the computation-intensive request tasks generated from mobile wireless equipment(MWE)with limited computational resources and energy.Due to the homogeneity of request tasks from one MWE during a longterm time period,it is vital to predeploy the particular service cachings required by the request tasks at the MEC server.In this paper,we model a service caching-assisted MEC framework that takes into account the constraint on the number of service cachings hosted by each edge server and the migration of request tasks from the current edge server to another edge server with service caching required by tasks.Furthermore,we propose a multiagent deep reinforcement learning-based computation offloading and task migrating decision-making scheme(MBOMS)to minimize the long-term average weighted cost.The proposed MBOMS can learn the near-optimal offloading and migrating decision-making policy by centralized training and decentralized execution.Systematic and comprehensive simulation results reveal that our proposed MBOMS can converge well after training and outperforms the other five baseline algorithms.
基金supported by the Major Science and Technology Programs in Henan Province(No.241100210100)The Project of Science and Technology in Henan Province(No.242102211068,No.232102210078)+2 种基金The Key Field Special Project of Guangdong Province(No.2021ZDZX1098)The China University Research Innovation Fund(No.2021FNB3001,No.2022IT020)Shenzhen Science and Technology Innovation Commission Stable Support Plan(No.20231128083944001)。
文摘Existing researches on cyber attackdefense analysis have typically adopted stochastic game theory to model the problem for solutions,but the assumption of complete rationality is used in modeling,ignoring the information opacity in practical attack and defense scenarios,and the model and method lack accuracy.To such problem,we investigate network defense policy methods under finite rationality constraints and propose network defense policy selection algorithm based on deep reinforcement learning.Based on graph theoretical methods,we transform the decision-making problem into a path optimization problem,and use a compression method based on service node to map the network state.On this basis,we improve the A3C algorithm and design the DefenseA3C defense policy selection algorithm with online learning capability.The experimental results show that the model and method proposed in this paper can stably converge to a better network state after training,which is faster and more stable than the original A3C algorithm.Compared with the existing typical approaches,Defense-A3C is verified its advancement.
基金State Grid Corporation of China Science and Technology Project“Research andApplication of Key Technologies for Trusted Issuance and Security Control of Electronic Licenses for Power Business”(5700-202353318A-1-1-ZN).
文摘To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront thechallenge of managing the surging demand for data traffic. Within this realm, the network imposes stringentQuality of Service (QoS) requirements, revealing the inadequacies of traditional routing allocation mechanismsin accommodating such extensive data flows. In response to the imperative of handling a substantial influx of datarequests promptly and alleviating the constraints of existing technologies and network congestion, we present anarchitecture forQoS routing optimizationwith in SoftwareDefinedNetwork (SDN), leveraging deep reinforcementlearning. This innovative approach entails the separation of SDN control and transmission functionalities, centralizingcontrol over data forwardingwhile integrating deep reinforcement learning for informed routing decisions. Byfactoring in considerations such as delay, bandwidth, jitter rate, and packet loss rate, we design a reward function toguide theDeepDeterministic PolicyGradient (DDPG) algorithmin learning the optimal routing strategy to furnishsuperior QoS provision. In our empirical investigations, we juxtapose the performance of Deep ReinforcementLearning (DRL) against that of Shortest Path (SP) algorithms in terms of data packet transmission delay. Theexperimental simulation results show that our proposed algorithm has significant efficacy in reducing networkdelay and improving the overall transmission efficiency, which is superior to the traditional methods.
基金Joint Funds of the National Natural Science Foundation of China,Grant/Award Number:U21A20518National Natural Science Foundation of China,Grant/Award Numbers:62106279,61903372。
文摘Policy evaluation(PE)is a critical sub-problem in reinforcement learning,which estimates the value function for a given policy and can be used for policy improvement.However,there still exist some limitations in current PE methods,such as low sample efficiency and local convergence,especially on complex tasks.In this study,a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning(LST2D)is proposed.In LST2D,an adaptive truncation mechanism is designed,which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning(TD).Then,two feature pre-training methods are utilised to improve the approximation ability of LST2D.Furthermore,an Actor-Critic algorithm based on LST2D and pre-trained feature representations(ACLPF)is proposed,where LST2D is integrated into the critic network to improve learning-prediction efficiency.Comprehensive simulation studies were conducted on four robotic tasks,and the corresponding results illustrate the effectiveness of LST2D.The proposed ACLPF algorithm outperformed DQN,ACER and PPO in terms of sample efficiency and stability,which demonstrated that LST2D can be applied to online learning control problems by incorporating it into the actor-critic architecture.
基金This research was funded by the Natural Science Foundation of Gansu Province with Approval Numbers 20JR10RA334 and 21JR7RA570Funding is provided for the 2021 Longyuan Youth Innovation and Entrepreneurship Talent Project with Approval Number 2021LQGR20+1 种基金the University Level Innovation Project with Approval NumbersGZF2020XZD18jbzxyb2018-01 of Gansu University of Political Science and Law.
文摘Image description task is the intersection of computer vision and natural language processing,and it has important prospects,including helping computers understand images and obtaining information for the visually impaired.This study presents an innovative approach employing deep reinforcement learning to enhance the accuracy of natural language descriptions of images.Our method focuses on refining the reward function in deep reinforcement learning,facilitating the generation of precise descriptions by aligning visual and textual features more closely.Our approach comprises three key architectures.Firstly,it utilizes Residual Network 101(ResNet-101)and Faster Region-based Convolutional Neural Network(Faster R-CNN)to extract average and local image features,respectively,followed by the implementation of a dual attention mechanism for intricate feature fusion.Secondly,the Transformer model is engaged to derive contextual semantic features from textual data.Finally,the generation of descriptive text is executed through a two-layer long short-term memory network(LSTM),directed by the value and reward functions.Compared with the image description method that relies on deep learning,the score of Bilingual Evaluation Understudy(BLEU-1)is 0.762,which is 1.6%higher,and the score of BLEU-4 is 0.299.Consensus-based Image Description Evaluation(CIDEr)scored 0.998,Recall-Oriented Understudy for Gisting Evaluation(ROUGE)scored 0.552,the latter improved by 0.36%.These results not only attest to the viability of our approach but also highlight its superiority in the realm of image description.Future research can explore the integration of our method with other artificial intelligence(AI)domains,such as emotional AI,to create more nuanced and context-aware systems.
基金the State Grid Liaoning Electric Power Supply Co.,Ltd.(Research on Scheduling Decision Technology Based on Interactive Reinforcement Learning for Adapting High Proportion of New Energy,No.2023YF-49).
文摘Traditional optimal scheduling methods are limited to accurate physical models and parameter settings, which aredifficult to adapt to the uncertainty of source and load, and there are problems such as the inability to make dynamicdecisions continuously. This paper proposed a dynamic economic scheduling method for distribution networksbased on deep reinforcement learning. Firstly, the economic scheduling model of the new energy distributionnetwork is established considering the action characteristics of micro-gas turbines, and the dynamic schedulingmodel based on deep reinforcement learning is constructed for the new energy distribution network system with ahigh proportion of new energy, and the Markov decision process of the model is defined. Secondly, Second, for thechanging characteristics of source-load uncertainty, agents are trained interactively with the distributed networkin a data-driven manner. Then, through the proximal policy optimization algorithm, agents adaptively learn thescheduling strategy and realize the dynamic scheduling decision of the new energy distribution network system.Finally, the feasibility and superiority of the proposed method are verified by an improved IEEE 33-node simulationsystem.
文摘Formany years,researchers have explored power allocation(PA)algorithms driven bymodels in wireless networks where multiple-user communications with interference are present.Nowadays,data-driven machine learning methods have become quite popular in analyzing wireless communication systems,which among them deep reinforcement learning(DRL)has a significant role in solving optimization issues under certain constraints.To this purpose,in this paper,we investigate the PA problem in a k-user multiple access channels(MAC),where k transmitters(e.g.,mobile users)aim to send an independent message to a common receiver(e.g.,base station)through wireless channels.To this end,we first train the deep Q network(DQN)with a deep Q learning(DQL)algorithm over the simulation environment,utilizing offline learning.Then,the DQN will be used with the real data in the online training method for the PA issue by maximizing the sumrate subjected to the source power.Finally,the simulation results indicate that our proposedDQNmethod provides better performance in terms of the sumrate compared with the available DQL training approaches such as fractional programming(FP)and weighted minimum mean squared error(WMMSE).Additionally,by considering different user densities,we show that our proposed DQN outperforms benchmark algorithms,thereby,a good generalization ability is verified over wireless multi-user communication systems.
基金Shaanxi Provincial Key Research and Development Project(2023YBGY095)and Shaanxi Provincial Qin Chuangyuan"Scientist+Engineer"project(2023KXJ247)Fund support.
文摘To solve the sparse reward problem of job-shop scheduling by deep reinforcement learning,a deep reinforcement learning framework considering sparse reward problem is proposed.The job shop scheduling problem is transformed into Markov decision process,and six state features are designed to improve the state feature representation by using two-way scheduling method,including four state features that distinguish the optimal action and two state features that are related to the learning goal.An extended variant of graph isomorphic network GIN++is used to encode disjunction graphs to improve the performance and generalization ability of the model.Through iterative greedy algorithm,random strategy is generated as the initial strategy,and the action with the maximum information gain is selected to expand it to optimize the exploration ability of Actor-Critic algorithm.Through validation of the trained policy model on multiple public test data sets and comparison with other advanced DRL methods and scheduling rules,the proposed method reduces the minimum average gap by 3.49%,5.31%and 4.16%,respectively,compared with the priority rule-based method,and 5.34%compared with the learning-based method.11.97%and 5.02%,effectively improving the accuracy of DRL to solve the approximate solution of JSSP minimum completion time.
基金supported by the National Key R&D Program of China under Grant 2020YFB1807204the BUPT Excellent Ph.D.Students Foundation under Grant CX2022306。
文摘Network-assisted full duplex(NAFD)cellfree(CF)massive MIMO has drawn increasing attention in 6G evolvement.In this paper,we build an NAFD CF system in which the users and access points(APs)can flexibly select their duplex modes to increase the link spectral efficiency.Then we formulate a joint flexible duplexing and power allocation problem to balance the user fairness and system spectral efficiency.We further transform the problem into a probability optimization to accommodate the shortterm communications.In contrast with the instant performance optimization,the probability optimization belongs to a sequential decision making problem,and thus we reformulate it as a Markov Decision Process(MDP).We utilizes deep reinforcement learning(DRL)algorithm to search the solution from a large state-action space,and propose an asynchronous advantage actor-critic(A3C)-based scheme to reduce the chance of converging to the suboptimal policy.Simulation results demonstrate that the A3C-based scheme is superior to the baseline schemes in term of the complexity,accumulated log spectral efficiency,and stability.
基金the Project of National Natural Science Foundation of China(Grant No.62106283)the Project of National Natural Science Foundation of China(Grant No.72001214)to provide fund for conducting experimentsthe Project of Natural Science Foundation of Shaanxi Province(Grant No.2020JQ-484)。
文摘The scale of ground-to-air confrontation task assignments is large and needs to deal with many concurrent task assignments and random events.Aiming at the problems where existing task assignment methods are applied to ground-to-air confrontation,there is low efficiency in dealing with complex tasks,and there are interactive conflicts in multiagent systems.This study proposes a multiagent architecture based on a one-general agent with multiple narrow agents(OGMN)to reduce task assignment conflicts.Considering the slow speed of traditional dynamic task assignment algorithms,this paper proposes the proximal policy optimization for task assignment of general and narrow agents(PPOTAGNA)algorithm.The algorithm based on the idea of the optimal assignment strategy algorithm and combined with the training framework of deep reinforcement learning(DRL)adds a multihead attention mechanism and a stage reward mechanism to the bilateral band clipping PPO algorithm to solve the problem of low training efficiency.Finally,simulation experiments are carried out in the digital battlefield.The multiagent architecture based on OGMN combined with the PPO-TAGNA algorithm can obtain higher rewards faster and has a higher win ratio.By analyzing agent behavior,the efficiency,superiority and rationality of resource utilization of this method are verified.
基金This work was partly supported by the Project of Cultivation for young top-motch Talents of Beijing Municipal Institutions(No BPHR202203225)the Young Elite Scientists Sponsorship Program by BAST(BYESS2023031)the National key research and development program(No 2022YFF0604502).
文摘Limited by battery and computing re-sources,the computing-intensive tasks generated by Internet of Things(IoT)devices cannot be processed all by themselves.Mobile edge computing(MEC)is a suitable solution for this problem,and the gener-ated tasks can be offloaded from IoT devices to MEC.In this paper,we study the problem of dynamic task offloading for digital twin-empowered MEC.Digital twin techniques are applied to provide information of environment and share the training data of agent de-ployed on IoT devices.We formulate the task offload-ing problem with the goal of maximizing the energy efficiency and the workload balance among the ESs.Then,we reformulate the problem as an MDP problem and design DRL-based energy efficient task offloading(DEETO)algorithm to solve it.Comparative experi-ments are carried out which show the superiority of our DEETO algorithm in improving energy efficiency and balancing the workload.
基金supported by the 2020 National Key R&D Program"Broadband Communication and New Network"special"6G Network Architecture and Key Technologies"(2020YFB1806700)。
文摘Flexible adaptation to differentiated quality of service(QoS)is quite important for future 6G network with a variety of services.Mobile ad hoc networks(MANETs)are able to provide flexible communication services to users through self-configuration and rapid deployment.However,the dynamic wireless environment,the limited resources,and complex QoS requirements have presented great challenges for network routing problems.Motivated by the development of artificial intelligence,a deep reinforcement learning-based collaborative routing(DRLCR)algorithm is proposed.Both routing policy and subchannel allocation are considered jointly,aiming at minimizing the end-to-end(E2E)delay and improving the network capacity.After sufficient training by the cluster head node,the Q-network can be synchronized to each member node to select the next hop based on local observation.Moreover,we improve the performance of training by considering historical observations,which can improve the adaptability of routing policies to dynamic environments.Simulation results show that the proposed DRLCR algorithm outperforms other algorithms in terms of resource utilization and E2E delay by optimizing network load to avoid congestion.In addition,the effectiveness of the routing policy in a dynamic environment is verified.