Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment.Previous model learning methods aim at fitting the transi...Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment.Previous model learning methods aim at fitting the transition data,and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state.The supervised model learning methods,however,diverge from the ultimate goal of model learning,i.e.,optimizing the learned-in-the-model policy.In this work,we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment.We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data.We thus derive the gradient of the model from this target and propose the Model Gradient algorithm(MG)to integrate this novel model learning approach with policy-gradient-based policy optimization.We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.展开更多
Fifth-generation(5G)systems have brought about new challenges toward ensuring Quality of Service(QoS)in differentiated services.This includes low latency applications,scalable machine-to-machine communication,and enha...Fifth-generation(5G)systems have brought about new challenges toward ensuring Quality of Service(QoS)in differentiated services.This includes low latency applications,scalable machine-to-machine communication,and enhanced mobile broadband connectivity.In order to satisfy these requirements,the concept of network slicing has been introduced to generate slices of the network with specific characteristics.In order to meet the requirements of network slices,routers and switches must be effectively configured to provide priority queue provisioning,resource contention management and adaptation.Configuring routers from vendors,such as Ericsson,Cisco,and Juniper,have traditionally been an expert-driven process with static rules for individual flows,which are prone to sub optimal configurations with varying traffic conditions.In this paper,we model the internal ingress and egress queues within routers via a queuing model.The effects of changing queue configuration with respect to priority,weights,flow limits,and packet drops are studied in detail.This is used to train a model-based Reinforcement Learning(RL)algorithm to generate optimal policies for flow prioritization,fairness,and congestion control.The efficacy of the RL policy output is demonstrated over scenarios involving ingress queue traffic policing,egress queue traffic shaping,and one-hop router coordinated traffic conditioning.This is evaluated over a real application use case,wherein a statically configured router proved sub optimal toward desired QoS requirements.Such automated configuration of routers and switches will be critical for multiple 5G deployments with varying flow requirements and traffic patterns.展开更多
Reinforcement learning(RL) has roots in dynamic programming and it is called adaptive/approximate dynamic programming(ADP) within the control community. This paper reviews recent developments in ADP along with RL and ...Reinforcement learning(RL) has roots in dynamic programming and it is called adaptive/approximate dynamic programming(ADP) within the control community. This paper reviews recent developments in ADP along with RL and its applications to various advanced control fields. First, the background of the development of ADP is described, emphasizing the significance of regulation and tracking control problems. Some effective offline and online algorithms for ADP/adaptive critic control are displayed, where the main results towards discrete-time systems and continuous-time systems are surveyed, respectively.Then, the research progress on adaptive critic control based on the event-triggered framework and under uncertain environment is discussed, respectively, where event-based design, robust stabilization, and game design are reviewed. Moreover, the extensions of ADP for addressing control problems under complex environment attract enormous attention. The ADP architecture is revisited under the perspective of data-driven and RL frameworks,showing how they promote ADP formulation significantly.Finally, several typical control applications with respect to RL and ADP are summarized, particularly in the fields of wastewater treatment processes and power systems, followed by some general prospects for future research. Overall, the comprehensive survey on ADP and RL for advanced control applications has d emonstrated its remarkable potential within the artificial intelligence era. In addition, it also plays a vital role in promoting environmental protection and industrial intelligence.展开更多
Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metavers...Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metaverses. However, avatar tasks include a multitude of human-to-avatar and avatar-to-avatar interactive applications, e.g., augmented reality navigation,which consumes intensive computing resources. It is inefficient and impractical for vehicles to process avatar tasks locally. Fortunately, migrating avatar tasks to the nearest roadside units(RSU)or unmanned aerial vehicles(UAV) for execution is a promising solution to decrease computation overhead and reduce task processing latency, while the high mobility of vehicles brings challenges for vehicles to independently perform avatar migration decisions depending on current and future vehicle status. To address these challenges, in this paper, we propose a novel avatar task migration system based on multi-agent deep reinforcement learning(MADRL) to execute immersive vehicular avatar tasks dynamically. Specifically, we first formulate the problem of avatar task migration from vehicles to RSUs/UAVs as a partially observable Markov decision process that can be solved by MADRL algorithms. We then design the multi-agent proximal policy optimization(MAPPO) approach as the MADRL algorithm for the avatar task migration problem. To overcome slow convergence resulting from the curse of dimensionality and non-stationary issues caused by shared parameters in MAPPO, we further propose a transformer-based MAPPO approach via sequential decision-making models for the efficient representation of relationships among agents. Finally, to motivate terrestrial or non-terrestrial edge servers(e.g., RSUs or UAVs) to share computation resources and ensure traceability of the sharing records, we apply smart contracts and blockchain technologies to achieve secure sharing management. Numerical results demonstrate that the proposed approach outperforms the MAPPO approach by around 2% and effectively reduces approximately 20% of the latency of avatar task execution in UAV-assisted vehicular Metaverses.展开更多
To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system,an interference decision method based Q-lea...To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system,an interference decision method based Q-learning algorithm is proposed.First,dividing the distance between the missile and the target into multiple states to increase the quantity of state spaces.Second,a multidimensional motion space is utilized,and the search range of which changes with the distance of the projectile,to select parameters and minimize the amount of ineffective interference parameters.The interference effect is determined by detecting whether the fuze signal disappears.Finally,a weighted reward function is used to determine the reward value based on the range state,output power,and parameter quantity information of the interference form.The effectiveness of the proposed method in selecting the range of motion space parameters and designing the discrimination degree of the reward function has been verified through offline experiments involving full-range missile rendezvous.The optimal interference form for each distance state has been obtained.Compared with the single-interference decision method,the proposed decision method can effectively improve the success rate of interference.展开更多
With the rapid advancement of quantum computing,hybrid quantum–classical machine learning has shown numerous potential applications at the current stage,with expectations of being achievable in the noisy intermediate...With the rapid advancement of quantum computing,hybrid quantum–classical machine learning has shown numerous potential applications at the current stage,with expectations of being achievable in the noisy intermediate-scale quantum(NISQ)era.Quantum reinforcement learning,as an indispensable study,has recently demonstrated its ability to solve standard benchmark environments with formally provable theoretical advantages over classical counterparts.However,despite the progress of quantum processors and the emergence of quantum computing clouds,implementing quantum reinforcement learning algorithms utilizing parameterized quantum circuits(PQCs)on NISQ devices remains infrequent.In this work,we take the first step towards executing benchmark quantum reinforcement problems on real devices equipped with at most 136 qubits on the BAQIS Quafu quantum computing cloud.The experimental results demonstrate that the policy agents can successfully accomplish objectives under modified conditions in both the training and inference phases.Moreover,we design hardware-efficient PQC architectures in the quantum model using a multi-objective evolutionary algorithm and develop a learning algorithm that is adaptable to quantum devices.We hope that the Quafu-RL can be a guiding example to show how to realize machine learning tasks by taking advantage of quantum computers on the quantum cloud platform.展开更多
While autonomous vehicles are vital components of intelligent transportation systems,ensuring the trustworthiness of decision-making remains a substantial challenge in realizing autonomous driving.Therefore,we present...While autonomous vehicles are vital components of intelligent transportation systems,ensuring the trustworthiness of decision-making remains a substantial challenge in realizing autonomous driving.Therefore,we present a novel robust reinforcement learning approach with safety guarantees to attain trustworthy decision-making for autonomous vehicles.The proposed technique ensures decision trustworthiness in terms of policy robustness and collision safety.Specifically,an adversary model is learned online to simulate the worst-case uncertainty by approximating the optimal adversarial perturbations on the observed states and environmental dynamics.In addition,an adversarial robust actor-critic algorithm is developed to enable the agent to learn robust policies against perturbations in observations and dynamics.Moreover,we devise a safety mask to guarantee the collision safety of the autonomous driving agent during both the training and testing processes using an interpretable knowledge model known as the Responsibility-Sensitive Safety Model.Finally,the proposed approach is evaluated through both simulations and experiments.These results indicate that the autonomous driving agent can make trustworthy decisions and drastically reduce the number of collisions through robust safety policies.展开更多
Contract Bridge,a four-player imperfect information game,comprises two phases:bidding and playing.While computer programs excel at playing,bidding presents a challenging aspect due to the need for information exchange...Contract Bridge,a four-player imperfect information game,comprises two phases:bidding and playing.While computer programs excel at playing,bidding presents a challenging aspect due to the need for information exchange with partners and interference with communication of opponents.In this work,we introduce a Bridge bidding agent that combines supervised learning,deep reinforcement learning via self-play,and a test-time search approach.Our experiments demonstrate that our agent outperforms WBridge5,a highly regarded computer Bridge software that has won multiple world championships,by a performance of 0.98 IMPs(international match points)per deal over 10000 deals,with a much cost-effective approach.The performance significantly surpasses previous state-of-the-art(0.85 IMPs per deal).Note 0.1 IMPs per deal is a significant improvement in Bridge bidding.展开更多
This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies.Unlike indepen-dent batch tasks,workflows typically consist of multiple subtasks with intrinsic correlations and dependenci...This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies.Unlike indepen-dent batch tasks,workflows typically consist of multiple subtasks with intrinsic correlations and dependencies.It necessitates the distribution of various computational tasks to appropriate computing node resources in accor-dance with task dependencies to ensure the smooth completion of the entire workflow.Workflow scheduling must consider an array of factors,including task dependencies,availability of computational resources,and the schedulability of tasks.Therefore,this paper delves into the distributed graph database workflow task scheduling problem and proposes a workflow scheduling methodology based on deep reinforcement learning(DRL).The method optimizes the maximum completion time(makespan)and response time of workflow tasks,aiming to enhance the responsiveness of workflow tasks while ensuring the minimization of the makespan.The experimental results indicate that the Q-learning Deep Reinforcement Learning(Q-DRL)algorithm markedly diminishes the makespan and refines the average response time within distributed graph database environments.In quantifying makespan,Q-DRL achieves mean reductions of 12.4%and 11.9%over established First-fit and Random scheduling strategies,respectively.Additionally,Q-DRL surpasses the performance of both DRL-Cloud and Improved Deep Q-learning Network(IDQN)algorithms,with improvements standing at 4.4%and 2.6%,respectively.With reference to average response time,the Q-DRL approach exhibits a significantly enhanced performance in the scheduling of workflow tasks,decreasing the average by 2.27%and 4.71%when compared to IDQN and DRL-Cloud,respectively.The Q-DRL algorithm also demonstrates a notable increase in the efficiency of system resource utilization,reducing the average idle rate by 5.02%and 9.30%in comparison to IDQN and DRL-Cloud,respectively.These findings support the assertion that Q-DRL not only upholds a lower average idle rate but also effectively curtails the average response time,thereby substantially improving processing efficiency and optimizing resource utilization within distributed graph database systems.展开更多
To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront thechallenge of managing the surging demand for data traffic. Within this realm, the network imposes stringentQu...To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront thechallenge of managing the surging demand for data traffic. Within this realm, the network imposes stringentQuality of Service (QoS) requirements, revealing the inadequacies of traditional routing allocation mechanismsin accommodating such extensive data flows. In response to the imperative of handling a substantial influx of datarequests promptly and alleviating the constraints of existing technologies and network congestion, we present anarchitecture forQoS routing optimizationwith in SoftwareDefinedNetwork (SDN), leveraging deep reinforcementlearning. This innovative approach entails the separation of SDN control and transmission functionalities, centralizingcontrol over data forwardingwhile integrating deep reinforcement learning for informed routing decisions. Byfactoring in considerations such as delay, bandwidth, jitter rate, and packet loss rate, we design a reward function toguide theDeepDeterministic PolicyGradient (DDPG) algorithmin learning the optimal routing strategy to furnishsuperior QoS provision. In our empirical investigations, we juxtapose the performance of Deep ReinforcementLearning (DRL) against that of Shortest Path (SP) algorithms in terms of data packet transmission delay. Theexperimental simulation results show that our proposed algorithm has significant efficacy in reducing networkdelay and improving the overall transmission efficiency, which is superior to the traditional methods.展开更多
The development of communication technology will promote the application of Internet of Things,and Beyond 5G will become a new technology promoter.At the same time,Beyond 5G will become one of the important supports f...The development of communication technology will promote the application of Internet of Things,and Beyond 5G will become a new technology promoter.At the same time,Beyond 5G will become one of the important supports for the development of edge computing technology.This paper proposes a communication task allocation algorithm based on deep reinforcement learning for vehicle-to-pedestrian communication scenarios in edge computing.Through trial and error learning of agent,the optimal spectrum and power can be determined for transmission without global information,so as to balance the communication between vehicle-to-pedestrian and vehicle-to-infrastructure.The results show that the agent can effectively improve vehicle-to-infrastructure communication rate as well as meeting the delay constraints on the vehicle-to-pedestrian link.展开更多
Heat integration is important for energy-saving in the process industry.It is linked to the persistently challenging task of optimal design of heat exchanger networks(HEN).Due to the inherent highly nonconvex nonlinea...Heat integration is important for energy-saving in the process industry.It is linked to the persistently challenging task of optimal design of heat exchanger networks(HEN).Due to the inherent highly nonconvex nonlinear and combinatorial nature of the HEN problem,it is not easy to find solutions of high quality for large-scale problems.The reinforcement learning(RL)method,which learns strategies through ongoing exploration and exploitation,reveals advantages in such area.However,due to the complexity of the HEN design problem,the RL method for HEN should be dedicated and designed.A hybrid strategy combining RL with mathematical programming is proposed to take better advantage of both methods.An insightful state representation of the HEN structure as well as a customized reward function is introduced.A Q-learning algorithm is applied to update the HEN structure using theε-greedy strategy.Better results are obtained from three literature cases of different scales.展开更多
The gasoline inline blending process has widely used real-time optimization techniques to achieve optimization objectives,such as minimizing the cost of production.However,the effectiveness of real-time optimization i...The gasoline inline blending process has widely used real-time optimization techniques to achieve optimization objectives,such as minimizing the cost of production.However,the effectiveness of real-time optimization in gasoline blending relies on accurate blending models and is challenged by stochastic disturbances.Thus,we propose a real-time optimization algorithm based on the soft actor-critic(SAC)deep reinforcement learning strategy to optimize gasoline blending without relying on a single blending model and to be robust against disturbances.Our approach constructs the environment using nonlinear blending models and feedstocks with disturbances.The algorithm incorporates the Lagrange multiplier and path constraints in reward design to manage sparse product constraints.Carefully abstracted states facilitate algorithm convergence,and the normalized action vector in each optimization period allows the agent to generalize to some extent across different target production scenarios.Through these well-designed components,the algorithm based on the SAC outperforms real-time optimization methods based on either nonlinear or linear programming.It even demonstrates comparable performance with the time-horizon based real-time optimization method,which requires knowledge of uncertainty models,confirming its capability to handle uncertainty without accurate models.Our simulation illustrates a promising approach to free real-time optimization of the gasoline blending process from uncertainty models that are difficult to acquire in practice.展开更多
The new energy vehicle plays a crucial role in green transportation,and the energy management strategy of hybrid power systems is essential for ensuring energy-efficient driving.This paper presents a state-of-the-art ...The new energy vehicle plays a crucial role in green transportation,and the energy management strategy of hybrid power systems is essential for ensuring energy-efficient driving.This paper presents a state-of-the-art survey and review of reinforcement learning-based energy management strategies for hybrid power systems.Additionally,it envisions the outlook for autonomous intelligent hybrid electric vehicles,with reinforcement learning as the foundational technology.First of all,to provide a macro view of historical development,the brief history of deep learning,reinforcement learning,and deep reinforcement learning is presented in the form of a timeline.Then,the comprehensive survey and review are conducted by collecting papers from mainstream academic databases.Enumerating most of the contributions based on three main directions—algorithm innovation,powertrain innovation,and environment innovation—provides an objective review of the research status.Finally,to advance the application of reinforcement learning in autonomous intelligent hybrid electric vehicles,future research plans positioned as“Alpha HEV”are envisioned,integrating Autopilot and energy-saving control.展开更多
In recent times,various power control and clustering approaches have been proposed to enhance overall performance for cell-free massive multipleinput multiple-output(CF-mMIMO)networks.With the emergence of deep reinfo...In recent times,various power control and clustering approaches have been proposed to enhance overall performance for cell-free massive multipleinput multiple-output(CF-mMIMO)networks.With the emergence of deep reinforcement learning(DRL),significant progress has been made in the field of network optimization as DRL holds great promise for improving network performance and efficiency.In this work,our focus delves into the intricate challenge of joint cooperation clustering and downlink power control within CF-mMIMO networks.Leveraging the potent deep deterministic policy gradient(DDPG)algorithm,our objective is to maximize the proportional fairness(PF)for user rates,thereby aiming to achieve optimal network performance and resource utilization.Moreover,we harness the concept of“divide and conquer”strategy,introducing two innovative methods termed alternating DDPG(A-DDPG)and hierarchical DDPG(H-DDPG).These approaches aim to decompose the intricate joint optimization problem into more manageable sub-problems,thereby facilitating a more efficient resolution process.Our findings unequivo-cally showcase the superior efficacy of our proposed DDPG approach over the baseline schemes in both clustering and downlink power control.Furthermore,the A-DDPG and H-DDPG obtain higher performance gain than DDPG with lower computational complexity.展开更多
Vehicular edge computing(VEC)is emerging as a promising solution paradigm to meet the requirements of compute-intensive applications in internet of vehicle(IoV).Non-orthogonal multiple access(NOMA)has advantages in im...Vehicular edge computing(VEC)is emerging as a promising solution paradigm to meet the requirements of compute-intensive applications in internet of vehicle(IoV).Non-orthogonal multiple access(NOMA)has advantages in improving spectrum efficiency and dealing with bandwidth scarcity and cost.It is an encouraging progress combining VEC and NOMA.In this paper,we jointly optimize task offloading decision and resource allocation to maximize the service utility of the NOMA-VEC system.To solve the optimization problem,we propose a multiagent deep graph reinforcement learning algorithm.The algorithm extracts the topological features and relationship information between agents from the system state as observations,outputs task offloading decision and resource allocation simultaneously with local policy network,which is updated by a local learner.Simulation results demonstrate that the proposed method achieves a 1.52%∼5.80%improvement compared with the benchmark algorithms in system service utility.展开更多
This survey paper provides a review and perspective on intermediate and advanced reinforcement learning(RL)techniques in process industries. It offers a holistic approach by covering all levels of the process control ...This survey paper provides a review and perspective on intermediate and advanced reinforcement learning(RL)techniques in process industries. It offers a holistic approach by covering all levels of the process control hierarchy. The survey paper presents a comprehensive overview of RL algorithms,including fundamental concepts like Markov decision processes and different approaches to RL, such as value-based, policy-based, and actor-critic methods, while also discussing the relationship between classical control and RL. It further reviews the wide-ranging applications of RL in process industries, such as soft sensors, low-level control, high-level control, distributed process control, fault detection and fault tolerant control, optimization,planning, scheduling, and supply chain. The survey paper discusses the limitations and advantages, trends and new applications, and opportunities and future prospects for RL in process industries. Moreover, it highlights the need for a holistic approach in complex systems due to the growing importance of digitalization in the process industries.展开更多
A real-time adaptive roles allocation method based on reinforcement learning is proposed to improve humanrobot cooperation performance for a curtain wall installation task.This method breaks the traditional idea that ...A real-time adaptive roles allocation method based on reinforcement learning is proposed to improve humanrobot cooperation performance for a curtain wall installation task.This method breaks the traditional idea that the robot is regarded as the follower or only adjusts the leader and the follower in cooperation.In this paper,a self-learning method is proposed which can dynamically adapt and continuously adjust the initiative weight of the robot according to the change of the task.Firstly,the physical human-robot cooperation model,including the role factor is built.Then,a reinforcement learningmodel that can adjust the role factor in real time is established,and a reward and actionmodel is designed.The role factor can be adjusted continuously according to the comprehensive performance of the human-robot interaction force and the robot’s Jerk during the repeated installation.Finally,the roles adjustment rule established above continuously improves the comprehensive performance.Experiments of the dynamic roles allocation and the effect of the performance weighting coefficient on the result have been verified.The results show that the proposed method can realize the role adaptation and achieve the dual optimization goal of reducing the sum of the cooperator force and the robot’s Jerk.展开更多
In the rapidly evolving landscape of today’s digital economy,Financial Technology(Fintech)emerges as a trans-formative force,propelled by the dynamic synergy between Artificial Intelligence(AI)and Algorithmic Trading...In the rapidly evolving landscape of today’s digital economy,Financial Technology(Fintech)emerges as a trans-formative force,propelled by the dynamic synergy between Artificial Intelligence(AI)and Algorithmic Trading.Our in-depth investigation delves into the intricacies of merging Multi-Agent Reinforcement Learning(MARL)and Explainable AI(XAI)within Fintech,aiming to refine Algorithmic Trading strategies.Through meticulous examination,we uncover the nuanced interactions of AI-driven agents as they collaborate and compete within the financial realm,employing sophisticated deep learning techniques to enhance the clarity and adaptability of trading decisions.These AI-infused Fintech platforms harness collective intelligence to unearth trends,mitigate risks,and provide tailored financial guidance,fostering benefits for individuals and enterprises navigating the digital landscape.Our research holds the potential to revolutionize finance,opening doors to fresh avenues for investment and asset management in the digital age.Additionally,our statistical evaluation yields encouraging results,with metrics such as Accuracy=0.85,Precision=0.88,and F1 Score=0.86,reaffirming the efficacy of our approach within Fintech and emphasizing its reliability and innovative prowess.展开更多
基金National Key R&D Program of China(2020AAA0107200)National Natural Science Foundation of China(Grant No.61921006)。
文摘Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment.Previous model learning methods aim at fitting the transition data,and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state.The supervised model learning methods,however,diverge from the ultimate goal of model learning,i.e.,optimizing the learned-in-the-model policy.In this work,we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment.We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data.We thus derive the gradient of the model from this target and propose the Model Gradient algorithm(MG)to integrate this novel model learning approach with policy-gradient-based policy optimization.We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.
文摘Fifth-generation(5G)systems have brought about new challenges toward ensuring Quality of Service(QoS)in differentiated services.This includes low latency applications,scalable machine-to-machine communication,and enhanced mobile broadband connectivity.In order to satisfy these requirements,the concept of network slicing has been introduced to generate slices of the network with specific characteristics.In order to meet the requirements of network slices,routers and switches must be effectively configured to provide priority queue provisioning,resource contention management and adaptation.Configuring routers from vendors,such as Ericsson,Cisco,and Juniper,have traditionally been an expert-driven process with static rules for individual flows,which are prone to sub optimal configurations with varying traffic conditions.In this paper,we model the internal ingress and egress queues within routers via a queuing model.The effects of changing queue configuration with respect to priority,weights,flow limits,and packet drops are studied in detail.This is used to train a model-based Reinforcement Learning(RL)algorithm to generate optimal policies for flow prioritization,fairness,and congestion control.The efficacy of the RL policy output is demonstrated over scenarios involving ingress queue traffic policing,egress queue traffic shaping,and one-hop router coordinated traffic conditioning.This is evaluated over a real application use case,wherein a statically configured router proved sub optimal toward desired QoS requirements.Such automated configuration of routers and switches will be critical for multiple 5G deployments with varying flow requirements and traffic patterns.
基金supported in part by the National Natural Science Foundation of China(62222301, 62073085, 62073158, 61890930-5, 62021003)the National Key Research and Development Program of China (2021ZD0112302, 2021ZD0112301, 2018YFC1900800-5)Beijing Natural Science Foundation (JQ19013)。
文摘Reinforcement learning(RL) has roots in dynamic programming and it is called adaptive/approximate dynamic programming(ADP) within the control community. This paper reviews recent developments in ADP along with RL and its applications to various advanced control fields. First, the background of the development of ADP is described, emphasizing the significance of regulation and tracking control problems. Some effective offline and online algorithms for ADP/adaptive critic control are displayed, where the main results towards discrete-time systems and continuous-time systems are surveyed, respectively.Then, the research progress on adaptive critic control based on the event-triggered framework and under uncertain environment is discussed, respectively, where event-based design, robust stabilization, and game design are reviewed. Moreover, the extensions of ADP for addressing control problems under complex environment attract enormous attention. The ADP architecture is revisited under the perspective of data-driven and RL frameworks,showing how they promote ADP formulation significantly.Finally, several typical control applications with respect to RL and ADP are summarized, particularly in the fields of wastewater treatment processes and power systems, followed by some general prospects for future research. Overall, the comprehensive survey on ADP and RL for advanced control applications has d emonstrated its remarkable potential within the artificial intelligence era. In addition, it also plays a vital role in promoting environmental protection and industrial intelligence.
基金supported in part by NSFC (62102099, U22A2054, 62101594)in part by the Pearl River Talent Recruitment Program (2021QN02S643)+9 种基金Guangzhou Basic Research Program (2023A04J1699)in part by the National Research Foundation, SingaporeInfocomm Media Development Authority under its Future Communications Research Development ProgrammeDSO National Laboratories under the AI Singapore Programme under AISG Award No AISG2-RP-2020-019Energy Research Test-Bed and Industry Partnership Funding Initiative, Energy Grid (EG) 2.0 programmeDesCartes and the Campus for Research Excellence and Technological Enterprise (CREATE) programmeMOE Tier 1 under Grant RG87/22in part by the Singapore University of Technology and Design (SUTD) (SRG-ISTD-2021- 165)in part by the SUTD-ZJU IDEA Grant SUTD-ZJU (VP) 202102in part by the Ministry of Education, Singapore, through its SUTD Kickstarter Initiative (SKI 20210204)。
文摘Avatars, as promising digital representations and service assistants of users in Metaverses, can enable drivers and passengers to immerse themselves in 3D virtual services and spaces of UAV-assisted vehicular Metaverses. However, avatar tasks include a multitude of human-to-avatar and avatar-to-avatar interactive applications, e.g., augmented reality navigation,which consumes intensive computing resources. It is inefficient and impractical for vehicles to process avatar tasks locally. Fortunately, migrating avatar tasks to the nearest roadside units(RSU)or unmanned aerial vehicles(UAV) for execution is a promising solution to decrease computation overhead and reduce task processing latency, while the high mobility of vehicles brings challenges for vehicles to independently perform avatar migration decisions depending on current and future vehicle status. To address these challenges, in this paper, we propose a novel avatar task migration system based on multi-agent deep reinforcement learning(MADRL) to execute immersive vehicular avatar tasks dynamically. Specifically, we first formulate the problem of avatar task migration from vehicles to RSUs/UAVs as a partially observable Markov decision process that can be solved by MADRL algorithms. We then design the multi-agent proximal policy optimization(MAPPO) approach as the MADRL algorithm for the avatar task migration problem. To overcome slow convergence resulting from the curse of dimensionality and non-stationary issues caused by shared parameters in MAPPO, we further propose a transformer-based MAPPO approach via sequential decision-making models for the efficient representation of relationships among agents. Finally, to motivate terrestrial or non-terrestrial edge servers(e.g., RSUs or UAVs) to share computation resources and ensure traceability of the sharing records, we apply smart contracts and blockchain technologies to achieve secure sharing management. Numerical results demonstrate that the proposed approach outperforms the MAPPO approach by around 2% and effectively reduces approximately 20% of the latency of avatar task execution in UAV-assisted vehicular Metaverses.
基金National Natural Science Foundation of China(61973037)National 173 Program Project(2019-JCJQ-ZD-324).
文摘To solve the problem of the low interference success rate of air defense missile radio fuzes due to the unified interference form of the traditional fuze interference system,an interference decision method based Q-learning algorithm is proposed.First,dividing the distance between the missile and the target into multiple states to increase the quantity of state spaces.Second,a multidimensional motion space is utilized,and the search range of which changes with the distance of the projectile,to select parameters and minimize the amount of ineffective interference parameters.The interference effect is determined by detecting whether the fuze signal disappears.Finally,a weighted reward function is used to determine the reward value based on the range state,output power,and parameter quantity information of the interference form.The effectiveness of the proposed method in selecting the range of motion space parameters and designing the discrimination degree of the reward function has been verified through offline experiments involving full-range missile rendezvous.The optimal interference form for each distance state has been obtained.Compared with the single-interference decision method,the proposed decision method can effectively improve the success rate of interference.
基金supported by the Beijing Academy of Quantum Information Sciencessupported by the National Natural Science Foundation of China(Grant No.92365206)+2 种基金the support of the China Postdoctoral Science Foundation(Certificate Number:2023M740272)supported by the National Natural Science Foundation of China(Grant No.12247168)China Postdoctoral Science Foundation(Certificate Number:2022TQ0036)。
文摘With the rapid advancement of quantum computing,hybrid quantum–classical machine learning has shown numerous potential applications at the current stage,with expectations of being achievable in the noisy intermediate-scale quantum(NISQ)era.Quantum reinforcement learning,as an indispensable study,has recently demonstrated its ability to solve standard benchmark environments with formally provable theoretical advantages over classical counterparts.However,despite the progress of quantum processors and the emergence of quantum computing clouds,implementing quantum reinforcement learning algorithms utilizing parameterized quantum circuits(PQCs)on NISQ devices remains infrequent.In this work,we take the first step towards executing benchmark quantum reinforcement problems on real devices equipped with at most 136 qubits on the BAQIS Quafu quantum computing cloud.The experimental results demonstrate that the policy agents can successfully accomplish objectives under modified conditions in both the training and inference phases.Moreover,we design hardware-efficient PQC architectures in the quantum model using a multi-objective evolutionary algorithm and develop a learning algorithm that is adaptable to quantum devices.We hope that the Quafu-RL can be a guiding example to show how to realize machine learning tasks by taking advantage of quantum computers on the quantum cloud platform.
基金supported in part by the Start-Up Grant-Nanyang Assistant Professorship Grant of Nanyang Technological Universitythe Agency for Science,Technology and Research(A*STAR)under Advanced Manufacturing and Engineering(AME)Young Individual Research under Grant(A2084c0156)+2 种基金the MTC Individual Research Grant(M22K2c0079)the ANR-NRF Joint Grant(NRF2021-NRF-ANR003 HM Science)the Ministry of Education(MOE)under the Tier 2 Grant(MOE-T2EP50222-0002)。
文摘While autonomous vehicles are vital components of intelligent transportation systems,ensuring the trustworthiness of decision-making remains a substantial challenge in realizing autonomous driving.Therefore,we present a novel robust reinforcement learning approach with safety guarantees to attain trustworthy decision-making for autonomous vehicles.The proposed technique ensures decision trustworthiness in terms of policy robustness and collision safety.Specifically,an adversary model is learned online to simulate the worst-case uncertainty by approximating the optimal adversarial perturbations on the observed states and environmental dynamics.In addition,an adversarial robust actor-critic algorithm is developed to enable the agent to learn robust policies against perturbations in observations and dynamics.Moreover,we devise a safety mask to guarantee the collision safety of the autonomous driving agent during both the training and testing processes using an interpretable knowledge model known as the Responsibility-Sensitive Safety Model.Finally,the proposed approach is evaluated through both simulations and experiments.These results indicate that the autonomous driving agent can make trustworthy decisions and drastically reduce the number of collisions through robust safety policies.
文摘Contract Bridge,a four-player imperfect information game,comprises two phases:bidding and playing.While computer programs excel at playing,bidding presents a challenging aspect due to the need for information exchange with partners and interference with communication of opponents.In this work,we introduce a Bridge bidding agent that combines supervised learning,deep reinforcement learning via self-play,and a test-time search approach.Our experiments demonstrate that our agent outperforms WBridge5,a highly regarded computer Bridge software that has won multiple world championships,by a performance of 0.98 IMPs(international match points)per deal over 10000 deals,with a much cost-effective approach.The performance significantly surpasses previous state-of-the-art(0.85 IMPs per deal).Note 0.1 IMPs per deal is a significant improvement in Bridge bidding.
基金funded by the Science and Technology Foundation of State Grid Corporation of China(Grant No.5108-202218280A-2-397-XG).
文摘This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies.Unlike indepen-dent batch tasks,workflows typically consist of multiple subtasks with intrinsic correlations and dependencies.It necessitates the distribution of various computational tasks to appropriate computing node resources in accor-dance with task dependencies to ensure the smooth completion of the entire workflow.Workflow scheduling must consider an array of factors,including task dependencies,availability of computational resources,and the schedulability of tasks.Therefore,this paper delves into the distributed graph database workflow task scheduling problem and proposes a workflow scheduling methodology based on deep reinforcement learning(DRL).The method optimizes the maximum completion time(makespan)and response time of workflow tasks,aiming to enhance the responsiveness of workflow tasks while ensuring the minimization of the makespan.The experimental results indicate that the Q-learning Deep Reinforcement Learning(Q-DRL)algorithm markedly diminishes the makespan and refines the average response time within distributed graph database environments.In quantifying makespan,Q-DRL achieves mean reductions of 12.4%and 11.9%over established First-fit and Random scheduling strategies,respectively.Additionally,Q-DRL surpasses the performance of both DRL-Cloud and Improved Deep Q-learning Network(IDQN)algorithms,with improvements standing at 4.4%and 2.6%,respectively.With reference to average response time,the Q-DRL approach exhibits a significantly enhanced performance in the scheduling of workflow tasks,decreasing the average by 2.27%and 4.71%when compared to IDQN and DRL-Cloud,respectively.The Q-DRL algorithm also demonstrates a notable increase in the efficiency of system resource utilization,reducing the average idle rate by 5.02%and 9.30%in comparison to IDQN and DRL-Cloud,respectively.These findings support the assertion that Q-DRL not only upholds a lower average idle rate but also effectively curtails the average response time,thereby substantially improving processing efficiency and optimizing resource utilization within distributed graph database systems.
基金State Grid Corporation of China Science and Technology Project“Research andApplication of Key Technologies for Trusted Issuance and Security Control of Electronic Licenses for Power Business”(5700-202353318A-1-1-ZN).
文摘To enhance the efficiency and expediency of issuing e-licenses within the power sector, we must confront thechallenge of managing the surging demand for data traffic. Within this realm, the network imposes stringentQuality of Service (QoS) requirements, revealing the inadequacies of traditional routing allocation mechanismsin accommodating such extensive data flows. In response to the imperative of handling a substantial influx of datarequests promptly and alleviating the constraints of existing technologies and network congestion, we present anarchitecture forQoS routing optimizationwith in SoftwareDefinedNetwork (SDN), leveraging deep reinforcementlearning. This innovative approach entails the separation of SDN control and transmission functionalities, centralizingcontrol over data forwardingwhile integrating deep reinforcement learning for informed routing decisions. Byfactoring in considerations such as delay, bandwidth, jitter rate, and packet loss rate, we design a reward function toguide theDeepDeterministic PolicyGradient (DDPG) algorithmin learning the optimal routing strategy to furnishsuperior QoS provision. In our empirical investigations, we juxtapose the performance of Deep ReinforcementLearning (DRL) against that of Shortest Path (SP) algorithms in terms of data packet transmission delay. Theexperimental simulation results show that our proposed algorithm has significant efficacy in reducing networkdelay and improving the overall transmission efficiency, which is superior to the traditional methods.
基金supported by National Natural Science Foundation of China(No.61871283)the Foundation of Pre-Research on Equipment of China(No.61400010304)Major Civil-Military Integration Project in Tianjin,China(No.18ZXJMTG00170).
文摘The development of communication technology will promote the application of Internet of Things,and Beyond 5G will become a new technology promoter.At the same time,Beyond 5G will become one of the important supports for the development of edge computing technology.This paper proposes a communication task allocation algorithm based on deep reinforcement learning for vehicle-to-pedestrian communication scenarios in edge computing.Through trial and error learning of agent,the optimal spectrum and power can be determined for transmission without global information,so as to balance the communication between vehicle-to-pedestrian and vehicle-to-infrastructure.The results show that the agent can effectively improve vehicle-to-infrastructure communication rate as well as meeting the delay constraints on the vehicle-to-pedestrian link.
基金The financial support provided by the Project of National Natural Science Foundation of China(U22A20415,21978256,22308314)“Pioneer”and“Leading Goose”Research&Development Program of Zhejiang(2022C01SA442617)。
文摘Heat integration is important for energy-saving in the process industry.It is linked to the persistently challenging task of optimal design of heat exchanger networks(HEN).Due to the inherent highly nonconvex nonlinear and combinatorial nature of the HEN problem,it is not easy to find solutions of high quality for large-scale problems.The reinforcement learning(RL)method,which learns strategies through ongoing exploration and exploitation,reveals advantages in such area.However,due to the complexity of the HEN design problem,the RL method for HEN should be dedicated and designed.A hybrid strategy combining RL with mathematical programming is proposed to take better advantage of both methods.An insightful state representation of the HEN structure as well as a customized reward function is introduced.A Q-learning algorithm is applied to update the HEN structure using theε-greedy strategy.Better results are obtained from three literature cases of different scales.
基金supported by National Key Research & Development Program-Intergovernmental International Science and Technology Innovation Cooperation Project (2021YFE0112800)National Natural Science Foundation of China (Key Program: 62136003)+2 种基金National Natural Science Foundation of China (62073142)Fundamental Research Funds for the Central Universities (222202417006)Shanghai Al Lab
文摘The gasoline inline blending process has widely used real-time optimization techniques to achieve optimization objectives,such as minimizing the cost of production.However,the effectiveness of real-time optimization in gasoline blending relies on accurate blending models and is challenged by stochastic disturbances.Thus,we propose a real-time optimization algorithm based on the soft actor-critic(SAC)deep reinforcement learning strategy to optimize gasoline blending without relying on a single blending model and to be robust against disturbances.Our approach constructs the environment using nonlinear blending models and feedstocks with disturbances.The algorithm incorporates the Lagrange multiplier and path constraints in reward design to manage sparse product constraints.Carefully abstracted states facilitate algorithm convergence,and the normalized action vector in each optimization period allows the agent to generalize to some extent across different target production scenarios.Through these well-designed components,the algorithm based on the SAC outperforms real-time optimization methods based on either nonlinear or linear programming.It even demonstrates comparable performance with the time-horizon based real-time optimization method,which requires knowledge of uncertainty models,confirming its capability to handle uncertainty without accurate models.Our simulation illustrates a promising approach to free real-time optimization of the gasoline blending process from uncertainty models that are difficult to acquire in practice.
基金Supported by National Natural Science Foundation of China (Grant Nos.52222215,52072051)Fundamental Research Funds for the Central Universities in China (Grant No.2023CDJXY-025)Chongqing Municipal Natural Science Foundation of China (Grant No.CSTB2023NSCQ-JQX0003)。
文摘The new energy vehicle plays a crucial role in green transportation,and the energy management strategy of hybrid power systems is essential for ensuring energy-efficient driving.This paper presents a state-of-the-art survey and review of reinforcement learning-based energy management strategies for hybrid power systems.Additionally,it envisions the outlook for autonomous intelligent hybrid electric vehicles,with reinforcement learning as the foundational technology.First of all,to provide a macro view of historical development,the brief history of deep learning,reinforcement learning,and deep reinforcement learning is presented in the form of a timeline.Then,the comprehensive survey and review are conducted by collecting papers from mainstream academic databases.Enumerating most of the contributions based on three main directions—algorithm innovation,powertrain innovation,and environment innovation—provides an objective review of the research status.Finally,to advance the application of reinforcement learning in autonomous intelligent hybrid electric vehicles,future research plans positioned as“Alpha HEV”are envisioned,integrating Autopilot and energy-saving control.
基金supported by Guangdong Basic and Applied Basic Research Foundation under Grant 2024A1515012015supported in part by the National Natural Science Foundation of China under Grant 62201336+4 种基金in part by Guangdong Basic and Applied Basic Research Foundation under Grant 2024A1515011541supported in part by the National Natural Science Foundation of China under Grant 62371344in part by the Fundamental Research Funds for the Central Universitiessupported in part by Knowledge Innovation Program of Wuhan-Shuguang Project under Grant 2023010201020316in part by Guangdong Basic and Applied Basic Research Foundation under Grant 2024A1515010247。
文摘In recent times,various power control and clustering approaches have been proposed to enhance overall performance for cell-free massive multipleinput multiple-output(CF-mMIMO)networks.With the emergence of deep reinforcement learning(DRL),significant progress has been made in the field of network optimization as DRL holds great promise for improving network performance and efficiency.In this work,our focus delves into the intricate challenge of joint cooperation clustering and downlink power control within CF-mMIMO networks.Leveraging the potent deep deterministic policy gradient(DDPG)algorithm,our objective is to maximize the proportional fairness(PF)for user rates,thereby aiming to achieve optimal network performance and resource utilization.Moreover,we harness the concept of“divide and conquer”strategy,introducing two innovative methods termed alternating DDPG(A-DDPG)and hierarchical DDPG(H-DDPG).These approaches aim to decompose the intricate joint optimization problem into more manageable sub-problems,thereby facilitating a more efficient resolution process.Our findings unequivo-cally showcase the superior efficacy of our proposed DDPG approach over the baseline schemes in both clustering and downlink power control.Furthermore,the A-DDPG and H-DDPG obtain higher performance gain than DDPG with lower computational complexity.
基金supported by the Talent Fund of Beijing Jiaotong University(No.2023XKRC028)CCFLenovo Blue Ocean Research Fund and Beijing Natural Science Foundation under Grant(No.L221003).
文摘Vehicular edge computing(VEC)is emerging as a promising solution paradigm to meet the requirements of compute-intensive applications in internet of vehicle(IoV).Non-orthogonal multiple access(NOMA)has advantages in improving spectrum efficiency and dealing with bandwidth scarcity and cost.It is an encouraging progress combining VEC and NOMA.In this paper,we jointly optimize task offloading decision and resource allocation to maximize the service utility of the NOMA-VEC system.To solve the optimization problem,we propose a multiagent deep graph reinforcement learning algorithm.The algorithm extracts the topological features and relationship information between agents from the system state as observations,outputs task offloading decision and resource allocation simultaneously with local policy network,which is updated by a local learner.Simulation results demonstrate that the proposed method achieves a 1.52%∼5.80%improvement compared with the benchmark algorithms in system service utility.
基金supported in part by the Natural Sciences Engineering Research Council of Canada (NSERC)。
文摘This survey paper provides a review and perspective on intermediate and advanced reinforcement learning(RL)techniques in process industries. It offers a holistic approach by covering all levels of the process control hierarchy. The survey paper presents a comprehensive overview of RL algorithms,including fundamental concepts like Markov decision processes and different approaches to RL, such as value-based, policy-based, and actor-critic methods, while also discussing the relationship between classical control and RL. It further reviews the wide-ranging applications of RL in process industries, such as soft sensors, low-level control, high-level control, distributed process control, fault detection and fault tolerant control, optimization,planning, scheduling, and supply chain. The survey paper discusses the limitations and advantages, trends and new applications, and opportunities and future prospects for RL in process industries. Moreover, it highlights the need for a holistic approach in complex systems due to the growing importance of digitalization in the process industries.
基金The research has been generously supported by Tianjin Education Commission Scientific Research Program(2020KJ056),ChinaTianjin Science and Technology Planning Project(22YDTPJC00970),China.The authors would like to express their sincere appreciation for all support provided.
文摘A real-time adaptive roles allocation method based on reinforcement learning is proposed to improve humanrobot cooperation performance for a curtain wall installation task.This method breaks the traditional idea that the robot is regarded as the follower or only adjusts the leader and the follower in cooperation.In this paper,a self-learning method is proposed which can dynamically adapt and continuously adjust the initiative weight of the robot according to the change of the task.Firstly,the physical human-robot cooperation model,including the role factor is built.Then,a reinforcement learningmodel that can adjust the role factor in real time is established,and a reward and actionmodel is designed.The role factor can be adjusted continuously according to the comprehensive performance of the human-robot interaction force and the robot’s Jerk during the repeated installation.Finally,the roles adjustment rule established above continuously improves the comprehensive performance.Experiments of the dynamic roles allocation and the effect of the performance weighting coefficient on the result have been verified.The results show that the proposed method can realize the role adaptation and achieve the dual optimization goal of reducing the sum of the cooperator force and the robot’s Jerk.
基金This project was funded by Deanship of Scientific Research(DSR)at King Abdulaziz University,Jeddah underGrant No.(IFPIP-1127-611-1443)the authors,therefore,acknowledge with thanks DSR technical and financial support.
文摘In the rapidly evolving landscape of today’s digital economy,Financial Technology(Fintech)emerges as a trans-formative force,propelled by the dynamic synergy between Artificial Intelligence(AI)and Algorithmic Trading.Our in-depth investigation delves into the intricacies of merging Multi-Agent Reinforcement Learning(MARL)and Explainable AI(XAI)within Fintech,aiming to refine Algorithmic Trading strategies.Through meticulous examination,we uncover the nuanced interactions of AI-driven agents as they collaborate and compete within the financial realm,employing sophisticated deep learning techniques to enhance the clarity and adaptability of trading decisions.These AI-infused Fintech platforms harness collective intelligence to unearth trends,mitigate risks,and provide tailored financial guidance,fostering benefits for individuals and enterprises navigating the digital landscape.Our research holds the potential to revolutionize finance,opening doors to fresh avenues for investment and asset management in the digital age.Additionally,our statistical evaluation yields encouraging results,with metrics such as Accuracy=0.85,Precision=0.88,and F1 Score=0.86,reaffirming the efficacy of our approach within Fintech and emphasizing its reliability and innovative prowess.