In this paper,a new optimal adaptive backstepping control approach for nonlinear systems under deception attacks via reinforcement learning is presented in this paper.The existence of nonlinear terms in the studied sy...In this paper,a new optimal adaptive backstepping control approach for nonlinear systems under deception attacks via reinforcement learning is presented in this paper.The existence of nonlinear terms in the studied system makes it very difficult to design the optimal controller using traditional methods.To achieve optimal control,RL algorithm based on critic–actor architecture is considered for the nonlinear system.Due to the significant security risks of network transmission,the system is vulnerable to deception attacks,which can make all the system state unavailable.By using the attacked states to design coordinate transformation,the harm brought by unknown deception attacks has been overcome.The presented control strategy can ensure that all signals in the closed-loop system are semi-globally ultimately bounded.Finally,the simulation experiment is shown to prove the effectiveness of the strategy.展开更多
In this paper,we present a novel data-driven design method for the human-robot interaction(HRI)system,where a given task is achieved by cooperation between the human and the robot.The presented HRI controller design i...In this paper,we present a novel data-driven design method for the human-robot interaction(HRI)system,where a given task is achieved by cooperation between the human and the robot.The presented HRI controller design is a two-level control design approach consisting of a task-oriented performance optimization design and a plant-oriented impedance controller design.The task-oriented design minimizes the human effort and guarantees the perfect task tracking in the outer-loop,while the plant-oriented achieves the desired impedance from the human to the robot manipulator end-effector in the inner-loop.Data-driven reinforcement learning techniques are used for performance optimization in the outer-loop to assign the optimal impedance parameters.In the inner-loop,a velocity-free filter is designed to avoid the requirement of end-effector velocity measurement.On this basis,an adaptive controller is designed to achieve the desired impedance of the robot manipulator in the task space.The simulation and experiment of a robot manipulator are conducted to verify the efficacy of the presented HRI design framework.展开更多
In this paper,a new algorithm combining the features of bi-direction evolutionary structural optimization(BESO)and reinforcement learning(RL)is proposed for continuum structural topology optimization(STO).In contrast ...In this paper,a new algorithm combining the features of bi-direction evolutionary structural optimization(BESO)and reinforcement learning(RL)is proposed for continuum structural topology optimization(STO).In contrast to conventional approaches which only generate a certain quasi-optimal solution,the goal of the combined method is to provide more quasi-optimal solutions for designers such as the idea of generative design.Two key components were adopted.First,besides sensitivity,value function updated by Monte-Carlo reinforcement learning was utilized to measure the importance of each element,which made the solving process convergent and closer to the optimum.Second,ε-greedy policy added a random perturbation to the main search direction so as to extend the search ability.Finally,the quality and diversity of solutions could be guaranteed by controlling the value of compliance as well as Intersection-over-Union(IoU).Results of several 2D and 3D compliance minimization problems,including a geometrically nonlinear case,show that the combined method is capable of generating a group of good and different solutions that satisfy various possible requirements in engineering design within acceptable computation cost.展开更多
Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus o...Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.展开更多
There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesi...There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesian Networks, Mixture Probability and Clustering Distribution, etc. However such methods give rise to the increase of the computational complexity. For another method, the adaptation performance to more complex environments such as multi-layer environments is required. In this study, we used profit-sharing method for the agent to learn its policy, and added a mixture probability into the RL system to recognize changes in the environment and appropriately improve the agent’s policy to adjust to the changing environment. We also introduced a clustering that enables a smaller, suitable selection in order to reduce the computational complexity and simultaneously maintain the system’s performance. The results of experiments presented that the agent successfully learned the policy and efficiently adjusted to the changing in multi-layer environment. Finally, the computational complexity and the decline in effectiveness of the policy improvement were controlled by using our proposed system.展开更多
Based on the existing pivot rules,the simplex method for linear programming is not polynomial in the worst case.Therefore,the optimal pivot of the simplex method is crucial.In this paper,we propose the optimal rule to...Based on the existing pivot rules,the simplex method for linear programming is not polynomial in the worst case.Therefore,the optimal pivot of the simplex method is crucial.In this paper,we propose the optimal rule to find all the shortest pivot paths of the simplex method for linear programming problems based on Monte Carlo tree search.Specifically,we first propose the SimplexPseudoTree to transfer the simplex method into tree search mode while avoiding repeated basis variables.Secondly,we propose four reinforcement learning models with two actions and two rewards to make the Monte Carlo tree search suitable for the simplex method.Thirdly,we set a new action selection criterion to ameliorate the inaccurate evaluation in the initial exploration.It is proved that when the number of vertices in the feasible region is C_(n)^(m),our method can generate all the shortest pivot paths,which is the polynomial of the number of variables.In addition,we experimentally validate that the proposed schedule can avoid unnecessary search and provide the optimal pivot path.Furthermore,this method can provide the best pivot labels for all kinds of supervised learning methods to solve linear programming problems.展开更多
With the rapid development of artificial intelligence in recent years,applying various learning techniques to solve mixed-integer linear programming(MILP)problems has emerged as a burgeoning research domain.Apart from...With the rapid development of artificial intelligence in recent years,applying various learning techniques to solve mixed-integer linear programming(MILP)problems has emerged as a burgeoning research domain.Apart from constructing end-to-end models directly,integrating learning approaches with some modules in the traditional methods for solving MILPs is also a promising direction.The cutting plane method is one of the fundamental algorithms used in modern MILP solvers,and the selection of appropriate cuts from the candidate cuts subset is crucial for enhancing efficiency.Due to the reliance on expert knowledge and problem-specific heuristics,classical cut selection methods are not always transferable and often limit the scalability and generalizability of the cutting plane method.To provide a more efficient and generalizable strategy,we propose a reinforcement learning(RL)framework to enhance cut selection in the solving process of MILPs.Firstly,we design feature vectors to incorporate the inherent properties of MILP and computational information from the solver and represent MILP instances as bipartite graphs.Secondly,we choose the weighted metrics to approximate the proximity of feasible solutions to the convex hull and utilize the learning method to determine the weights assigned to each metric.Thirdly,a graph convolutional neural network is adopted with a self-attention mechanism to predict the value of weighting factors.Finally,we transform the cut selection process into a Markov decision process and utilize RL method to train the model.Extensive experiments are conducted based on a leading open-source MILP solver SCIP.Results on both general and specific datasets validate the effectiveness and efficiency of our proposed approach.展开更多
Unmanned Aerial Vehicles(UAVs)play a vital role in military warfare.In a variety of battlefield mission scenarios,UAVs are required to safely fly to designated locations without human intervention.Therefore,finding a ...Unmanned Aerial Vehicles(UAVs)play a vital role in military warfare.In a variety of battlefield mission scenarios,UAVs are required to safely fly to designated locations without human intervention.Therefore,finding a suitable method to solve the UAV Autonomous Motion Planning(AMP)problem can improve the success rate of UAV missions to a certain extent.In recent years,many studies have used Deep Reinforcement Learning(DRL)methods to address the AMP problem and have achieved good results.From the perspective of sampling,this paper designs a sampling method with double-screening,combines it with the Deep Deterministic Policy Gradient(DDPG)algorithm,and proposes the Relevant Experience Learning-DDPG(REL-DDPG)algorithm.The REL-DDPG algorithm uses a Prioritized Experience Replay(PER)mechanism to break the correlation of continuous experiences in the experience pool,finds the experiences most similar to the current state to learn according to the theory in human education,and expands the influence of the learning process on action selection at the current state.All experiments are applied in a complex unknown simulation environment constructed based on the parameters of a real UAV.The training experiments show that REL-DDPG improves the convergence speed and the convergence result compared to the state-of-the-art DDPG algorithm,while the testing experiments show the applicability of the algorithm and investigate the performance under different parameter conditions.展开更多
In this study,a real-time optimal control approach is proposed using an interactive deep reinforcement learning algorithm for the Moon fuel-optimal landing problem.Considering the remote communication restrictions and...In this study,a real-time optimal control approach is proposed using an interactive deep reinforcement learning algorithm for the Moon fuel-optimal landing problem.Considering the remote communication restrictions and environmental uncertainties,advanced landing control techniques are demanded to meet the high requirements of real-time performance and autonomy in the Moon landing missions.Deep reinforcement learning(DRL)algorithms have been recently developed for real-time optimal control but suffer the obstacles of slow convergence and difficult reward function design.To address these problems,a DRL algorithm is developed using an actor-indirect method architecture to achieve the optimal control of the Moon landing mission.In this DRL algorithm,an indirect method is employed to generate the optimal control actions for the deep neural network(DNN)learning,while the trained DNNs provide good initial guesses for the indirect method to promote the efficiency of training data generation.Through sufficient learning of the state-action relationship,the trained DNNs can approximate the optimal actions and steer the spacecraft to the target in real time.Additionally,a nonlinear feedback controller is developed to improve the terminal landing accuracy.Numerical simulations are given to verify the effectiveness of the proposed DRL algorithm and demonstrate the performance of the developed optimal landing controller.展开更多
Richly formatted documents,such as financial disclosures,scientific articles,government regulations,widely exist on Web.However,since most of these documents are only for public reading,the styling information inside ...Richly formatted documents,such as financial disclosures,scientific articles,government regulations,widely exist on Web.However,since most of these documents are only for public reading,the styling information inside them is usually missing,making them improper or even burdensome to be displayed and edited in different formats and platforms.In this study we formulate the task of document styling restoration as an optimization problem,which aims to identify the styling settings on the document elements,e.g.,lines,table cells,text,so that rendering with the output styling settings results in a document,where each element inside it holds the(closely)exact position with the one in the original document.Considering that each styling setting is a decision,this problem can be transformed as a multi-step decision-making task over all the document elements,and then be solved by reinforcement learning.Specifically,Monte-Carlo Tree Search(MCTS)is leveraged to explore the different styling settings,and the policy function is learnt under the supervision of the delayed rewards.As a case study,we restore the styling information inside tables,where structural and functional data in the documents are usually presented.Experiment shows that,our best reinforcement method successfully restores the stylings in 87.65%of the tables,with 25.75%absolute improvement over the greedymethod.We also discuss the tradeoff between the inference time and restoration success rate,and argue that although the reinforcement methods cannot be used in real-time scenarios,it is suitable for the offline tasks with high-quality requirement.Finally,this model has been applied in a PDF parser to support cross-format display.展开更多
Purpose-This purpose of this paper is to provide an overview of the theoretical background and applications of inverse reinforcement learning(IRL).Design/methodology/approach-Reinforcement learning(RL)techniques provi...Purpose-This purpose of this paper is to provide an overview of the theoretical background and applications of inverse reinforcement learning(IRL).Design/methodology/approach-Reinforcement learning(RL)techniques provide a powerful solution for sequential decision making problems under uncertainty.RL uses an agent equipped with a reward function to find a policy through interactions with a dynamic environment.However,one major assumption of existing RL algorithms is that reward function,the most succinct representation of the designer’s intention,needs to be provided beforehand.In practice,the reward function can be very hard to specify and exhaustive to tune for large and complex problems,and this inspires the development of IRL,an extension of RL,which directly tackles this problem by learning the reward function through expert demonstrations.In this paper,the original IRL algorithms and its close variants,as well as their recent advances are reviewed and compared.Findings-This paper can serve as an introduction guide of fundamental theory and developments,as well as the applications of IRL.Originality/value-This paper surveys the theories and applications of IRL,which is the latest development of RL and has not been done so far.展开更多
As one of the most fundamental topics in reinforcement learning(RL),sample efficiency is essential to the deployment of deep RL algorithms.Unlike most existing exploration methods that sample an action from different ...As one of the most fundamental topics in reinforcement learning(RL),sample efficiency is essential to the deployment of deep RL algorithms.Unlike most existing exploration methods that sample an action from different types of posterior distributions,we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment.Specifically,we first employ clustering methods in the policy sampling process to generate an action candidate set.Then we introduce a clustering buffer for modeling the internal hierarchy,which consists of on-policy data,off-policy data,and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage.In this way,our approach is able to take advantage of the supervision information in the expert demonstration data.Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling.In particular,on the LGSVL task,our method can reduce the number of convergence steps by 46.7%and the convergence time by 28.5%.Furthermore,our code is open-source for reproducibility.The code is available at https://github.com/Shihwin/SelectiveSampling.展开更多
在基于SDN架构的混合卫星网络上讨论了服务功能链(Service Function Chain,SFC)的可靠性部署问题,首先对SFC可靠性保护的问题进行描述,建立了底层网络与SFC请求模型,然后建立了网络服务功能的可靠性需求模型与低轨卫星链路的可靠性需求...在基于SDN架构的混合卫星网络上讨论了服务功能链(Service Function Chain,SFC)的可靠性部署问题,首先对SFC可靠性保护的问题进行描述,建立了底层网络与SFC请求模型,然后建立了网络服务功能的可靠性需求模型与低轨卫星链路的可靠性需求模型,明确了优化目标与约束条件。接着提出基于可靠性的卫星服务功能链保护方法,包括基于深度强化学习的可靠性保护算法和基于低轨卫星节点与链路可靠性备份算法。实验表明,提出的基于可靠性的卫星服务功能链保护方法能在SDN架构的混合卫星网络上提高SFC请求接受率,减少平均时延,在不同的SFC可靠性需求的条件下也保持较高的请求接受率。展开更多
基金supported in part by the National Key R&D Program of China under Grants 2021YFE0206100in part by the National Natural Science Foundation of China under Grant 62073321+2 种基金in part by National Defense Basic Scientific Research Program JCKY2019203C029in part by the Science and Technology Development Fund,Macao SAR under Grants FDCT-22-009-MISE,0060/2021/A2 and 0015/2020/AMJin part by the financial support from the National Defense Basic Scientific Research Project(JCKY2020130C025).
文摘In this paper,a new optimal adaptive backstepping control approach for nonlinear systems under deception attacks via reinforcement learning is presented in this paper.The existence of nonlinear terms in the studied system makes it very difficult to design the optimal controller using traditional methods.To achieve optimal control,RL algorithm based on critic–actor architecture is considered for the nonlinear system.Due to the significant security risks of network transmission,the system is vulnerable to deception attacks,which can make all the system state unavailable.By using the attacked states to design coordinate transformation,the harm brought by unknown deception attacks has been overcome.The presented control strategy can ensure that all signals in the closed-loop system are semi-globally ultimately bounded.Finally,the simulation experiment is shown to prove the effectiveness of the strategy.
基金This work was supported in part by the National Natural Science Foundation of China(61903028)the Youth Innovation Promotion Association,Chinese Academy of Sciences(2020137)+1 种基金the Lifelong Learning Machines Program from DARPA/Microsystems Technology Officethe Army Research Laboratory(W911NF-18-2-0260).
文摘In this paper,we present a novel data-driven design method for the human-robot interaction(HRI)system,where a given task is achieved by cooperation between the human and the robot.The presented HRI controller design is a two-level control design approach consisting of a task-oriented performance optimization design and a plant-oriented impedance controller design.The task-oriented design minimizes the human effort and guarantees the perfect task tracking in the outer-loop,while the plant-oriented achieves the desired impedance from the human to the robot manipulator end-effector in the inner-loop.Data-driven reinforcement learning techniques are used for performance optimization in the outer-loop to assign the optimal impedance parameters.In the inner-loop,a velocity-free filter is designed to avoid the requirement of end-effector velocity measurement.On this basis,an adaptive controller is designed to achieve the desired impedance of the robot manipulator in the task space.The simulation and experiment of a robot manipulator are conducted to verify the efficacy of the presented HRI design framework.
文摘In this paper,a new algorithm combining the features of bi-direction evolutionary structural optimization(BESO)and reinforcement learning(RL)is proposed for continuum structural topology optimization(STO).In contrast to conventional approaches which only generate a certain quasi-optimal solution,the goal of the combined method is to provide more quasi-optimal solutions for designers such as the idea of generative design.Two key components were adopted.First,besides sensitivity,value function updated by Monte-Carlo reinforcement learning was utilized to measure the importance of each element,which made the solving process convergent and closer to the optimum.Second,ε-greedy policy added a random perturbation to the main search direction so as to extend the search ability.Finally,the quality and diversity of solutions could be guaranteed by controlling the value of compliance as well as Intersection-over-Union(IoU).Results of several 2D and 3D compliance minimization problems,including a geometrically nonlinear case,show that the combined method is capable of generating a group of good and different solutions that satisfy various possible requirements in engineering design within acceptable computation cost.
基金supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.RS-2022-00155885, Artificial Intelligence Convergence Innovation Human Resources Development (Hanyang University ERICA))supported by the National Natural Science Foundation of China under Grant No. 61971264the National Natural Science Foundation of China/Research Grants Council Collaborative Research Scheme under Grant No. 62261160390
文摘Due to the fading characteristics of wireless channels and the burstiness of data traffic,how to deal with congestion in Ad-hoc networks with effective algorithms is still open and challenging.In this paper,we focus on enabling congestion control to minimize network transmission delays through flexible power control.To effectively solve the congestion problem,we propose a distributed cross-layer scheduling algorithm,which is empowered by graph-based multi-agent deep reinforcement learning.The transmit power is adaptively adjusted in real-time by our algorithm based only on local information(i.e.,channel state information and queue length)and local communication(i.e.,information exchanged with neighbors).Moreover,the training complexity of the algorithm is low due to the regional cooperation based on the graph attention network.In the evaluation,we show that our algorithm can reduce the transmission delay of data flow under severe signal interference and drastically changing channel states,and demonstrate the adaptability and stability in different topologies.The method is general and can be extended to various types of topologies.
文摘There are many proposed policy-improving systems of Reinforcement Learning (RL) agents which are effective in quickly adapting to environmental change by using many statistical methods, such as mixture model of Bayesian Networks, Mixture Probability and Clustering Distribution, etc. However such methods give rise to the increase of the computational complexity. For another method, the adaptation performance to more complex environments such as multi-layer environments is required. In this study, we used profit-sharing method for the agent to learn its policy, and added a mixture probability into the RL system to recognize changes in the environment and appropriately improve the agent’s policy to adjust to the changing environment. We also introduced a clustering that enables a smaller, suitable selection in order to reduce the computational complexity and simultaneously maintain the system’s performance. The results of experiments presented that the agent successfully learned the policy and efficiently adjusted to the changing in multi-layer environment. Finally, the computational complexity and the decline in effectiveness of the policy improvement were controlled by using our proposed system.
基金supported by National Key R&D Program of China(Grant No.2021YFA1000403)National Natural Science Foundation of China(Grant No.11991022)+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(Grant No.XDA27000000)the Fundamental Research Funds for the Central Universities。
文摘Based on the existing pivot rules,the simplex method for linear programming is not polynomial in the worst case.Therefore,the optimal pivot of the simplex method is crucial.In this paper,we propose the optimal rule to find all the shortest pivot paths of the simplex method for linear programming problems based on Monte Carlo tree search.Specifically,we first propose the SimplexPseudoTree to transfer the simplex method into tree search mode while avoiding repeated basis variables.Secondly,we propose four reinforcement learning models with two actions and two rewards to make the Monte Carlo tree search suitable for the simplex method.Thirdly,we set a new action selection criterion to ameliorate the inaccurate evaluation in the initial exploration.It is proved that when the number of vertices in the feasible region is C_(n)^(m),our method can generate all the shortest pivot paths,which is the polynomial of the number of variables.In addition,we experimentally validate that the proposed schedule can avoid unnecessary search and provide the optimal pivot path.Furthermore,this method can provide the best pivot labels for all kinds of supervised learning methods to solve linear programming problems.
基金supported by the National Key R&D Program of China(Grant No.2022YFB2403400)National Natural Science Foundation of China(Grant Nos.11991021 and 12021001)。
文摘With the rapid development of artificial intelligence in recent years,applying various learning techniques to solve mixed-integer linear programming(MILP)problems has emerged as a burgeoning research domain.Apart from constructing end-to-end models directly,integrating learning approaches with some modules in the traditional methods for solving MILPs is also a promising direction.The cutting plane method is one of the fundamental algorithms used in modern MILP solvers,and the selection of appropriate cuts from the candidate cuts subset is crucial for enhancing efficiency.Due to the reliance on expert knowledge and problem-specific heuristics,classical cut selection methods are not always transferable and often limit the scalability and generalizability of the cutting plane method.To provide a more efficient and generalizable strategy,we propose a reinforcement learning(RL)framework to enhance cut selection in the solving process of MILPs.Firstly,we design feature vectors to incorporate the inherent properties of MILP and computational information from the solver and represent MILP instances as bipartite graphs.Secondly,we choose the weighted metrics to approximate the proximity of feasible solutions to the convex hull and utilize the learning method to determine the weights assigned to each metric.Thirdly,a graph convolutional neural network is adopted with a self-attention mechanism to predict the value of weighting factors.Finally,we transform the cut selection process into a Markov decision process and utilize RL method to train the model.Extensive experiments are conducted based on a leading open-source MILP solver SCIP.Results on both general and specific datasets validate the effectiveness and efficiency of our proposed approach.
基金co-supported by the National Natural Science Foundation of China(Nos.62003267,61573285)the Aeronautical Science Foundation of China(ASFC)(No.20175553027)Natural Science Basic Research Plan in Shaanxi Province of China(No.2020JQ-220)。
文摘Unmanned Aerial Vehicles(UAVs)play a vital role in military warfare.In a variety of battlefield mission scenarios,UAVs are required to safely fly to designated locations without human intervention.Therefore,finding a suitable method to solve the UAV Autonomous Motion Planning(AMP)problem can improve the success rate of UAV missions to a certain extent.In recent years,many studies have used Deep Reinforcement Learning(DRL)methods to address the AMP problem and have achieved good results.From the perspective of sampling,this paper designs a sampling method with double-screening,combines it with the Deep Deterministic Policy Gradient(DDPG)algorithm,and proposes the Relevant Experience Learning-DDPG(REL-DDPG)algorithm.The REL-DDPG algorithm uses a Prioritized Experience Replay(PER)mechanism to break the correlation of continuous experiences in the experience pool,finds the experiences most similar to the current state to learn according to the theory in human education,and expands the influence of the learning process on action selection at the current state.All experiments are applied in a complex unknown simulation environment constructed based on the parameters of a real UAV.The training experiments show that REL-DDPG improves the convergence speed and the convergence result compared to the state-of-the-art DDPG algorithm,while the testing experiments show the applicability of the algorithm and investigate the performance under different parameter conditions.
基金This work is supported by the National Natural Science Foundation of China(Grants Nos.11672146 and 11432001).
文摘In this study,a real-time optimal control approach is proposed using an interactive deep reinforcement learning algorithm for the Moon fuel-optimal landing problem.Considering the remote communication restrictions and environmental uncertainties,advanced landing control techniques are demanded to meet the high requirements of real-time performance and autonomy in the Moon landing missions.Deep reinforcement learning(DRL)algorithms have been recently developed for real-time optimal control but suffer the obstacles of slow convergence and difficult reward function design.To address these problems,a DRL algorithm is developed using an actor-indirect method architecture to achieve the optimal control of the Moon landing mission.In this DRL algorithm,an indirect method is employed to generate the optimal control actions for the deep neural network(DNN)learning,while the trained DNNs provide good initial guesses for the indirect method to promote the efficiency of training data generation.Through sufficient learning of the state-action relationship,the trained DNNs can approximate the optimal actions and steer the spacecraft to the target in real time.Additionally,a nonlinear feedback controller is developed to improve the terminal landing accuracy.Numerical simulations are given to verify the effectiveness of the proposed DRL algorithm and demonstrate the performance of the developed optimal landing controller.
基金This work was supported by the National Key Research and Development Program of China(2017YFB1002104)the National Natural Science Foundation of China(Grant No.U1811461)the Innovation Program of Institute of Computing Technology,CAS.
文摘Richly formatted documents,such as financial disclosures,scientific articles,government regulations,widely exist on Web.However,since most of these documents are only for public reading,the styling information inside them is usually missing,making them improper or even burdensome to be displayed and edited in different formats and platforms.In this study we formulate the task of document styling restoration as an optimization problem,which aims to identify the styling settings on the document elements,e.g.,lines,table cells,text,so that rendering with the output styling settings results in a document,where each element inside it holds the(closely)exact position with the one in the original document.Considering that each styling setting is a decision,this problem can be transformed as a multi-step decision-making task over all the document elements,and then be solved by reinforcement learning.Specifically,Monte-Carlo Tree Search(MCTS)is leveraged to explore the different styling settings,and the policy function is learnt under the supervision of the delayed rewards.As a case study,we restore the styling information inside tables,where structural and functional data in the documents are usually presented.Experiment shows that,our best reinforcement method successfully restores the stylings in 87.65%of the tables,with 25.75%absolute improvement over the greedymethod.We also discuss the tradeoff between the inference time and restoration success rate,and argue that although the reinforcement methods cannot be used in real-time scenarios,it is suitable for the offline tasks with high-quality requirement.Finally,this model has been applied in a PDF parser to support cross-format display.
文摘Purpose-This purpose of this paper is to provide an overview of the theoretical background and applications of inverse reinforcement learning(IRL).Design/methodology/approach-Reinforcement learning(RL)techniques provide a powerful solution for sequential decision making problems under uncertainty.RL uses an agent equipped with a reward function to find a policy through interactions with a dynamic environment.However,one major assumption of existing RL algorithms is that reward function,the most succinct representation of the designer’s intention,needs to be provided beforehand.In practice,the reward function can be very hard to specify and exhaustive to tune for large and complex problems,and this inspires the development of IRL,an extension of RL,which directly tackles this problem by learning the reward function through expert demonstrations.In this paper,the original IRL algorithms and its close variants,as well as their recent advances are reviewed and compared.Findings-This paper can serve as an introduction guide of fundamental theory and developments,as well as the applications of IRL.Originality/value-This paper surveys the theories and applications of IRL,which is the latest development of RL and has not been done so far.
基金supported by the National Natural Science Foundation of China (No.62176059)the Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01)Zhangjiang Lab,and the Shanghai Center for Brain Science and Brain-inspired Technology。
文摘As one of the most fundamental topics in reinforcement learning(RL),sample efficiency is essential to the deployment of deep RL algorithms.Unlike most existing exploration methods that sample an action from different types of posterior distributions,we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment.Specifically,we first employ clustering methods in the policy sampling process to generate an action candidate set.Then we introduce a clustering buffer for modeling the internal hierarchy,which consists of on-policy data,off-policy data,and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage.In this way,our approach is able to take advantage of the supervision information in the expert demonstration data.Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling.In particular,on the LGSVL task,our method can reduce the number of convergence steps by 46.7%and the convergence time by 28.5%.Furthermore,our code is open-source for reproducibility.The code is available at https://github.com/Shihwin/SelectiveSampling.
文摘在基于SDN架构的混合卫星网络上讨论了服务功能链(Service Function Chain,SFC)的可靠性部署问题,首先对SFC可靠性保护的问题进行描述,建立了底层网络与SFC请求模型,然后建立了网络服务功能的可靠性需求模型与低轨卫星链路的可靠性需求模型,明确了优化目标与约束条件。接着提出基于可靠性的卫星服务功能链保护方法,包括基于深度强化学习的可靠性保护算法和基于低轨卫星节点与链路可靠性备份算法。实验表明,提出的基于可靠性的卫星服务功能链保护方法能在SDN架构的混合卫星网络上提高SFC请求接受率,减少平均时延,在不同的SFC可靠性需求的条件下也保持较高的请求接受率。