The coronavirus has affected many areas of life,especially in the field of education.With the beginning of the Pandemic,the transition to online learning began,which affected the development of students and teachers i...The coronavirus has affected many areas of life,especially in the field of education.With the beginning of the Pandemic,the transition to online learning began,which affected the development of students and teachers in terms of using innovative technologies and programs,such as Zoom,Webex,Discord,Google Meet,Moodle,EDX,Coursera,www.examus.network,etc.In this regard,many teachers are wondering whether the online method of teaching is as effective as the offline method.In this article,we focused on finding out whether there is a significant difference in student performance between online and offline modes of learning in the study of mathematics.58 students were in a group where they studied online and 58 students were in a group where they studied offline.The study involved first-year college students of Jambyl Innovative Higher College(JICH)in Taraz,Kazakhstan.The final control work was carried out at the end of week 18,which tested all areas covered by the topic in both groups.The average scores of students studying offline were compared with the average of students studying online.To avoid confusion,the researchers also conducted and analyzed an independent t-test.The results showed that there is a significant difference in the academic performance of students who study online and offline.The offline teaching method has proven to be more effective for improving students’understanding and comprehension of mathematics topics.展开更多
Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the m...Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the mismatch between the knowledge of the learned policy and the reality of the underlying environment.Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution(OOD)queries as much as possible,but this can influence the robustness of the agents at unseen states.In this paper,we propose a simple but effective method to address this issue.The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions,and we propose to find those regions by simulating them with modified Generative Adversarial Nets(GAN)such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves,with regard to the behavior policy or some other reference policy.We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions.Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method.展开更多
With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provi...With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provides reliable support for reconfiguration optimization in urban distribution networks.Thus,this study proposed a deep reinforcement learning based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,the multi-level dynamic reconfiguration method was discussed,which included feeder-,transformer-,and substation-levels.Subsequently,the multi-agent system was combined with the cloud-edge collaboration architecture to build a deep reinforcement learning model for multi-level dynamic reconfiguration in an urban distribution network.The cloud-edge collaboration architecture can effectively support the multi-agent system to conduct“centralized training and decentralized execution”operation modes and improve the learning efficiency of the model.Thereafter,for a multi-agent system,this study adopted a combination of offline and online learning to endow the model with the ability to realize automatic optimization and updation of the strategy.In the offline learning phase,a Q-learning-based multi-agent conservative Q-learning(MACQL)algorithm was proposed to stabilize the learning results and reduce the risk of the next online learning phase.In the online learning phase,a multi-agent deep deterministic policy gradient(MADDPG)algorithm based on policy gradients was proposed to explore the action space and update the experience pool.Finally,the effectiveness of the proposed method was verified through a simulation analysis of a real-world 445-node system.展开更多
At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars fo...At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars for high efficiency,high precision,and high automation.Therefore,it is necessary to explore a new intelligent radar control learning framework and technology to improve the capability and automation of radar detection.Reinforcement learning is popular in decision task learning,but the shortage of samples in radar control tasks makes it difficult to meet the requirements of reinforcement learning.To address the above issues,we propose a practical radar operation reinforcement learning framework,and integrate offline reinforcement learning and meta-reinforcement learning methods to alleviate the sample requirements of reinforcement learning.Experimental results show that our method can automatically perform as humans in radar detection with real-world settings,thereby promoting the practical application of reinforcement learning in radar operation.展开更多
Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)acce...Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)accelerating RL deployment with rich offline data.Existing RL methods fail to handle these two issues simultaneously,thereby we propose a novel offline RL method targeting hybrid action space.A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way.This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy.Critically,a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions.Our method demonstrates superior performance and generality across different tasks,particularly in typical realistic wargaming scenarios.展开更多
Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution s...Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL.To tackle this problem,some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy.However,since these generated states or state-action pairs are not guaranteed to be OOD,staying conservative on them may adversely affect the in-distribution ones.In this paper,we propose an OOD state-conservative offline RL method(OSCAR),which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset,and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states.In this way,we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement.We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states.OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS.Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return,substantially outperforming existing offline RL methods.展开更多
Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement...Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement learning(MARL)tasks,given the combinatorially increased interactions among agents and with the environment.However,in MARL,the paradigm of offline pre-training with online fine-tuning has not been studied,nor even datasets or benchmarks for offline MARL research are available.In this paper,we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL.We investigate the generalization of MARL offline pre-training in the following three aspects:1)between single agents and multiple agents,2)from offline pretraining to online fine tuning,and 3)to that of multiple downstream tasks with few-shot and zero-shot capabilities.We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,and then propose the novel architecture of multi-agent decision transformer(MADT)for effective offline learning.MADT leverages the transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks.A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios.On the StarCraft II offline dataset,MADT outperforms the state-of-the-art offline reinforcement learning(RL)baselines,including BCQ and CQL.When applied to online tasks,the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases.To the best of our knowledge,this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.展开更多
Service composition is an important and effective technique that enables atomic services to be combined together to forma more powerful service,i.e.,a composite service.With the pervasiveness of the Internet and the p...Service composition is an important and effective technique that enables atomic services to be combined together to forma more powerful service,i.e.,a composite service.With the pervasiveness of the Internet and the proliferation of interconnected computing devices,it is essential that service composition embraces an adaptive service provisioning perspective.Reinforcement learning has emerged as a powerful tool to compose and adapt Web services in open and dynamic environments.However,the most common applications of reinforcement learning algorithms are relatively inefficient in their use of the interaction experience data,whichmay affect the stability of the learning process when deployed to cloud environments.In particular,they make just one learning update for each interaction experience.This paper introduces a novel approach that aims to achieve greater data efficiency by saving the experience data and using it in aggregate to make updates to the learned policy.The proposed approach devises an offline learning scheme for cloud service composition where the online learning task is transformed into a series of supervised learning tasks.A set of algorithms is proposed under this scheme in order to facilitate and empower efficient service composition in the cloud under various policies and different scenarios.The results of our experiments show the effectiveness of the proposed approach for composing and adapting cloud services,especially under dynamic environment settings,compared to their online learning counterparts.展开更多
Traditional multi-agent deep reinforcement learning has difficulty obtaining rewards,slow convergence,and effective cooperation among agents in the pretraining period due to the large joint state space and sparse rewa...Traditional multi-agent deep reinforcement learning has difficulty obtaining rewards,slow convergence,and effective cooperation among agents in the pretraining period due to the large joint state space and sparse rewards for action.Therefore,this paper discusses the role of demonstration data in multiagent systems and proposes a multi-agent deep reinforcement learning algorithm from fuse adaptive weight fusion demonstration data.The algorithm sets the weights according to the performance and uses the importance sampling method to bridge the deviation in the mixed sampled data to combine the expert data obtained in the simulation environment with the distributed multi-agent reinforcement learning algorithm to solve the difficult problem.The problem of global exploration improves the convergence speed of the algorithm.The results in the RoboCup2D soccer simulation environment show that the algorithm improves the ability of the agent to hold and shoot the ball,enabling the agent to achieve a higher goal scoring rate and convergence speed relative to demonstration policies and mainstream multi-agent reinforcement learning algorithms.展开更多
文摘The coronavirus has affected many areas of life,especially in the field of education.With the beginning of the Pandemic,the transition to online learning began,which affected the development of students and teachers in terms of using innovative technologies and programs,such as Zoom,Webex,Discord,Google Meet,Moodle,EDX,Coursera,www.examus.network,etc.In this regard,many teachers are wondering whether the online method of teaching is as effective as the offline method.In this article,we focused on finding out whether there is a significant difference in student performance between online and offline modes of learning in the study of mathematics.58 students were in a group where they studied online and 58 students were in a group where they studied offline.The study involved first-year college students of Jambyl Innovative Higher College(JICH)in Taraz,Kazakhstan.The final control work was carried out at the end of week 18,which tested all areas covered by the topic in both groups.The average scores of students studying offline were compared with the average of students studying online.To avoid confusion,the researchers also conducted and analyzed an independent t-test.The results showed that there is a significant difference in the academic performance of students who study online and offline.The offline teaching method has proven to be more effective for improving students’understanding and comprehension of mathematics topics.
基金supported by the National Key R&D program of China under Grant No.2021ZD0113203National Science Foundation of China under Grant No.61976115.
文摘Offline reinforcement learning(ORL)aims to learn a rational agent purely from behavior data without any online interaction.One of the major challenges encountered in ORL is the problem of distribution shift,i.e.,the mismatch between the knowledge of the learned policy and the reality of the underlying environment.Recent works usually handle this in a too pessimistic manner to avoid out-of-distribution(OOD)queries as much as possible,but this can influence the robustness of the agents at unseen states.In this paper,we propose a simple but effective method to address this issue.The key idea of our method is to enhance the robustness of the new policy learned offline by weakening its confidence in highly uncertain regions,and we propose to find those regions by simulating them with modified Generative Adversarial Nets(GAN)such that the generated data not only follow the same distribution with the old experience but are very difficult to deal with by themselves,with regard to the behavior policy or some other reference policy.We then use this information to regularize the ORL algorithm to penalize the overconfidence behavior in these regions.Extensive experiments on several publicly available offline RL benchmarks demonstrate the feasibility and effectiveness of the proposed method.
基金supported by the National Natural Science Foundation of China under Grant 52077146.
文摘With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provides reliable support for reconfiguration optimization in urban distribution networks.Thus,this study proposed a deep reinforcement learning based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,the multi-level dynamic reconfiguration method was discussed,which included feeder-,transformer-,and substation-levels.Subsequently,the multi-agent system was combined with the cloud-edge collaboration architecture to build a deep reinforcement learning model for multi-level dynamic reconfiguration in an urban distribution network.The cloud-edge collaboration architecture can effectively support the multi-agent system to conduct“centralized training and decentralized execution”operation modes and improve the learning efficiency of the model.Thereafter,for a multi-agent system,this study adopted a combination of offline and online learning to endow the model with the ability to realize automatic optimization and updation of the strategy.In the offline learning phase,a Q-learning-based multi-agent conservative Q-learning(MACQL)algorithm was proposed to stabilize the learning results and reduce the risk of the next online learning phase.In the online learning phase,a multi-agent deep deterministic policy gradient(MADDPG)algorithm based on policy gradients was proposed to explore the action space and update the experience pool.Finally,the effectiveness of the proposed method was verified through a simulation analysis of a real-world 445-node system.
基金supported by Science and Technology Innovation 2030 New Generation Artificial Intelligence Major Project under Grant No.2021ZD0113303the National Natural Science Foundation of China under Grant Nos.62192783 and 62276128,and in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization.
文摘At present,the parameters of radar detection rely heavily on manual adjustment and empirical knowledge,resulting in low automation.Traditional manual adjustment methods cannot meet the requirements of modern radars for high efficiency,high precision,and high automation.Therefore,it is necessary to explore a new intelligent radar control learning framework and technology to improve the capability and automation of radar detection.Reinforcement learning is popular in decision task learning,but the shortage of samples in radar control tasks makes it difficult to meet the requirements of reinforcement learning.To address the above issues,we propose a practical radar operation reinforcement learning framework,and integrate offline reinforcement learning and meta-reinforcement learning methods to alleviate the sample requirements of reinforcement learning.Experimental results show that our method can automatically perform as humans in radar detection with real-world settings,thereby promoting the practical application of reinforcement learning in radar operation.
文摘Reinforcement Learning(RL)has emerged as a promising data-driven solution for wargaming decision-making.However,two domain challenges still exist:(1)dealing with discrete-continuous hybrid wargaming control and(2)accelerating RL deployment with rich offline data.Existing RL methods fail to handle these two issues simultaneously,thereby we propose a novel offline RL method targeting hybrid action space.A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way.This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy.Critically,a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions.Our method demonstrates superior performance and generality across different tasks,particularly in typical realistic wargaming scenarios.
基金supported by the National Key R&D Program of China(No.2022ZD0116402)the National Natural Science Foundation of China(No.62106172).
文摘Offline reinforcement learning(RL)is a data-driven learning paradigm for sequential decision making.Mitigating the overestimation of values originating from out-of-distribution(OOD)states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL.To tackle this problem,some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy.However,since these generated states or state-action pairs are not guaranteed to be OOD,staying conservative on them may adversely affect the in-distribution ones.In this paper,we propose an OOD state-conservative offline RL method(OSCAR),which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset,and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states.In this way,we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement.We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states.OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS.Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return,substantially outperforming existing offline RL methods.
基金Linghui Meng was supported in part by the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDA27030300)Haifeng Zhang was supported in part by the National Natural Science Foundation of China(No.62206289).
文摘Offline reinforcement learning leverages previously collected offline datasets to learn optimal policies with no necessity to access the real environment.Such a paradigm is also desirable for multi-agent reinforcement learning(MARL)tasks,given the combinatorially increased interactions among agents and with the environment.However,in MARL,the paradigm of offline pre-training with online fine-tuning has not been studied,nor even datasets or benchmarks for offline MARL research are available.In this paper,we facilitate the research by providing large-scale datasets and using them to examine the usage of the decision transformer in the context of MARL.We investigate the generalization of MARL offline pre-training in the following three aspects:1)between single agents and multiple agents,2)from offline pretraining to online fine tuning,and 3)to that of multiple downstream tasks with few-shot and zero-shot capabilities.We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment,and then propose the novel architecture of multi-agent decision transformer(MADT)for effective offline learning.MADT leverages the transformer′s modelling ability for sequence modelling and integrates it seamlessly with both offline and online MARL tasks.A significant benefit of MADT is that it learns generalizable policies that can transfer between different types of agents under different task scenarios.On the StarCraft II offline dataset,MADT outperforms the state-of-the-art offline reinforcement learning(RL)baselines,including BCQ and CQL.When applied to online tasks,the pre-trained MADT significantly improves sample efficiency and enjoys strong performance in both few-short and zero-shot cases.To the best of our knowledge,this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalizability enhancements for MARL.
基金This work has been supported by KAKENHI Grant Number 20288837.
文摘Service composition is an important and effective technique that enables atomic services to be combined together to forma more powerful service,i.e.,a composite service.With the pervasiveness of the Internet and the proliferation of interconnected computing devices,it is essential that service composition embraces an adaptive service provisioning perspective.Reinforcement learning has emerged as a powerful tool to compose and adapt Web services in open and dynamic environments.However,the most common applications of reinforcement learning algorithms are relatively inefficient in their use of the interaction experience data,whichmay affect the stability of the learning process when deployed to cloud environments.In particular,they make just one learning update for each interaction experience.This paper introduces a novel approach that aims to achieve greater data efficiency by saving the experience data and using it in aggregate to make updates to the learned policy.The proposed approach devises an offline learning scheme for cloud service composition where the online learning task is transformed into a series of supervised learning tasks.A set of algorithms is proposed under this scheme in order to facilitate and empower efficient service composition in the cloud under various policies and different scenarios.The results of our experiments show the effectiveness of the proposed approach for composing and adapting cloud services,especially under dynamic environment settings,compared to their online learning counterparts.
文摘Traditional multi-agent deep reinforcement learning has difficulty obtaining rewards,slow convergence,and effective cooperation among agents in the pretraining period due to the large joint state space and sparse rewards for action.Therefore,this paper discusses the role of demonstration data in multiagent systems and proposes a multi-agent deep reinforcement learning algorithm from fuse adaptive weight fusion demonstration data.The algorithm sets the weights according to the performance and uses the importance sampling method to bridge the deviation in the mixed sampled data to combine the expert data obtained in the simulation environment with the distributed multi-agent reinforcement learning algorithm to solve the difficult problem.The problem of global exploration improves the convergence speed of the algorithm.The results in the RoboCup2D soccer simulation environment show that the algorithm improves the ability of the agent to hold and shoot the ball,enabling the agent to achieve a higher goal scoring rate and convergence speed relative to demonstration policies and mainstream multi-agent reinforcement learning algorithms.