Traditional expert-designed branching rules in branch-and-bound(B&B) are static, often failing to adapt to diverse and evolving problem instances. Crafting these rules is labor-intensive, and may not scale well wi...Traditional expert-designed branching rules in branch-and-bound(B&B) are static, often failing to adapt to diverse and evolving problem instances. Crafting these rules is labor-intensive, and may not scale well with complex problems.Given the frequent need to solve varied combinatorial optimization problems, leveraging statistical learning to auto-tune B&B algorithms for specific problem classes becomes attractive. This paper proposes a graph pointer network model to learn the branch rules. Graph features, global features and historical features are designated to represent the solver state. The graph neural network processes graph features, while the pointer mechanism assimilates the global and historical features to finally determine the variable on which to branch. The model is trained to imitate the expert strong branching rule by a tailored top-k Kullback-Leibler divergence loss function. Experiments on a series of benchmark problems demonstrate that the proposed approach significantly outperforms the widely used expert-designed branching rules. It also outperforms state-of-the-art machine-learning-based branch-and-bound methods in terms of solving speed and search tree size on all the test instances. In addition, the model can generalize to unseen instances and scale to larger instances.展开更多
Mobile Edge Computing(MEC)is promising to alleviate the computation and storage burdens for terminals in wireless networks.The huge energy consumption of MEC servers challenges the establishment of smart cities and th...Mobile Edge Computing(MEC)is promising to alleviate the computation and storage burdens for terminals in wireless networks.The huge energy consumption of MEC servers challenges the establishment of smart cities and their service time powered by rechargeable batteries.In addition,Orthogonal Multiple Access(OMA)technique cannot utilize limited spectrum resources fully and efficiently.Therefore,Non-Orthogonal Multiple Access(NOMA)-based energy-efficient task scheduling among MEC servers for delay-constraint mobile applications is important,especially in highly-dynamic vehicular edge computing networks.The various movement patterns of vehicles lead to unbalanced offloading requirements and different load pressure for MEC servers.Self-Imitation Learning(SIL)-based Deep Reinforcement Learning(DRL)has emerged as a promising machine learning technique to break through obstacles in various research fields,especially in time-varying networks.In this paper,we first introduce related MEC technologies in vehicular networks.Then,we propose an energy-efficient approach for task scheduling in vehicular edge computing networks based on DRL,with the purpose of both guaranteeing the task latency requirement for multiple users and minimizing total energy consumption of MEC servers.Numerical results demonstrate that the proposed algorithm outperforms other methods.展开更多
Providing autonomous systems with an effective quantity and quality of information from a desired task is challenging. In particular, autonomous vehicles, must have a reliable vision of their workspace to robustly acc...Providing autonomous systems with an effective quantity and quality of information from a desired task is challenging. In particular, autonomous vehicles, must have a reliable vision of their workspace to robustly accomplish driving functions. Speaking of machine vision, deep learning techniques, and specifically convolutional neural networks, have been proven to be the state of the art technology in the field. As these networks typically involve millions of parameters and elements, designing an optimal architecture for deep learning structures is a difficult task which is globally under investigation by researchers. This study experimentally evaluates the impact of three major architectural properties of convolutional networks, including the number of layers, filters, and filter size on their performance. In this study, several models with different properties are developed,equally trained, and then applied to an autonomous car in a realistic simulation environment. A new ensemble approach is also proposed to calculate and update weights for the models regarding their mean squared error values. Based on design properties,performance results are reported and compared for further investigations. Surprisingly, the number of filters itself does not largely affect the performance efficiency. As a result, proper allocation of filters with different kernel sizes through the layers introduces a considerable improvement in the performance.Achievements of this study will provide the researchers with a clear clue and direction in designing optimal network architectures for deep learning purposes.展开更多
Edge computation offloading allows mobile end devices to execute compute-inten?sive tasks on edge servers. End devices can decide whether the tasks are offloaded to edge servers, cloud servers or executed locally acco...Edge computation offloading allows mobile end devices to execute compute-inten?sive tasks on edge servers. End devices can decide whether the tasks are offloaded to edge servers, cloud servers or executed locally according to current network condition and devic?es'profiles in an online manner. In this paper, we propose an edge computation offloading framework based on deep imitation learning (DIL) and knowledge distillation (KD), which assists end devices to quickly make fine-grained decisions to optimize the delay of computa?tion tasks online. We formalize a computation offloading problem into a multi-label classifi?cation problem. Training samples for our DIL model are generated in an offline manner. Af?ter the model is trained, we leverage KD to obtain a lightweight DIL model, by which we fur?ther reduce the model's inference delay. Numerical experiment shows that the offloading de?cisions made by our model not only outperform those made by other related policies in laten?cy metric, but also have the shortest inference delay among all policies.展开更多
The intermittency of renewable energy generation,variability of load demand,and stochasticity of market price bring about direct challenges to optimal energy management of microgrids.To cope with these different forms...The intermittency of renewable energy generation,variability of load demand,and stochasticity of market price bring about direct challenges to optimal energy management of microgrids.To cope with these different forms of operation uncertainties,an imitation learning based real-time decision-mak-ing solution for microgrid economic dispatch is proposed.In this solution,the optimal dispatch trajectories obtained by solving the optimal problem using historical deterministic operation patterns are demonstrated as the expert samples for imitation learning.To improve the generalization performance of imitation learning and the expressive ability of uncertain variables,a hybrid model combining the unsupervised and supervised learning is utilized.The denoising autoencoder based unsupervised learning model is adopted to enhance the feature extraction of operation patterns.Furthermore,the long short-term memory network based supervised learning model is used to efficiently characterize the mapping between the input space composed of the extracted operation patterns and system state variables and the output space composed of the optimal dispatch trajectories.The numerical simulation results demonstrate that under various operation uncertainties,the operation cost achieved by the proposed solution is close to the minimum theoretical value.Compared with the traditional model predictive control method and basic clone imitation learning method,the operation cost of the proposed solution is reduced by 6.3% and 2.8%,respectively,overa test period of three months.展开更多
Learning from demonstration(LfD)is an appealing method of helping robots learn new skills.Numerous papers have presented methods of LfD with good performance in robotics.However,complicated robot tasks that need to ca...Learning from demonstration(LfD)is an appealing method of helping robots learn new skills.Numerous papers have presented methods of LfD with good performance in robotics.However,complicated robot tasks that need to carefully regulate path planning strategies remain unanswered.Contact or non-contact constraints in specific robot tasks make the path planning problem more difficult,as the interaction between the robot and the environment is time-varying.In this paper,we focus on the path planning of complex robot tasks in the domain of LfD and give a novel perspective for classifying imitation learning and inverse reinforcement learning.This classification is based on constraints and obstacle avoidance.Finally,we summarize these methods and present promising directions for robot application and LfD theory.展开更多
With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real ...With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations.展开更多
We propose a new framework for entity and event extraction based on generative adversarial imitation learning-an inverse reinforcement learning method using a generative adversarial network(GAN).We assume that instanc...We propose a new framework for entity and event extraction based on generative adversarial imitation learning-an inverse reinforcement learning method using a generative adversarial network(GAN).We assume that instances and labels yield to various extents of difficulty and the gains and penalties(rewards)are expected to be diverse.We utilize discriminators to estimate proper rewards according to the difference between the labels committed by the ground-truth(expert)and the extractor(agent).Our experiments demonstrate that the proposed framework outperforms state-of-the-art methods.展开更多
This paper is concerned with the problems of mapless navigation for unmanned aerial vehicles in the scenarios with limited sensor accuracy and computing capability.A novel learning-based algorithm called soft actor-cr...This paper is concerned with the problems of mapless navigation for unmanned aerial vehicles in the scenarios with limited sensor accuracy and computing capability.A novel learning-based algorithm called soft actor-critic from demonstrations(SACfD)is proposed,integrating reinforcement learning with imitation learning.Specifically,the maximum entropy reinforcement learning framework is introduced to enhance the exploration capability of the algorithm,upon which the paper explores a way to sufficiently leverage demonstration data to significantly accelerate the convergence rate while improving policy performance reliably.Further,the proposed algorithm enables an implementation of mapless navigation for unmanned aerial vehicles and experimental results show that it outperforms the existing algorithms.展开更多
Reinforcement learning(RL)has shown significant success in sequential decision making in fields like autonomous vehicles,robotics,marketing and gaming industries.This success has attracted the attention to the RL cont...Reinforcement learning(RL)has shown significant success in sequential decision making in fields like autonomous vehicles,robotics,marketing and gaming industries.This success has attracted the attention to the RL control approach for building energy systems which are becoming complicated due to the need to optimize for multiple,potentially conflicting,goals like occupant comfort,energy use and grid interactivity.However,for real world applications,RL has several drawbacks like requiring large training data and time,and unstable control behavior during the early exploration process making it infeasible for an application directly to building control tasks.To address these issues,an imitation learning approach is utilized herein where the RL agents starts with a policy transferred from accepted rule based policies and heuristic policies.This approach is successful in reducing the training time,preventing the unstable early exploration behavior and improving upon an accepted rule-based policy-all of these make RL a more practical control approach for real world applications in the domain of building controls.展开更多
Generative adversarial imitation learning(GAIL)directly imitates the behavior of experts from human demonstration instead of designing explicit reward signals like reinforcement learning.Meanwhile,GAIL overcomes the d...Generative adversarial imitation learning(GAIL)directly imitates the behavior of experts from human demonstration instead of designing explicit reward signals like reinforcement learning.Meanwhile,GAIL overcomes the defects of traditional imitation learning by using a generative adversary network framework and shows excellent performance in many fields.However,GAIL directly acts on immediate rewards,a feature that is reflected in the value function after a period of accumulation.Thus,when faced with complex practical problems,the learning efficiency of GAIL is often extremely low and the policy may be slow to learn.One way to solve this problem is to directly guide the action(policy)in the agents'learning process,such as the control sharing(CS)method.This paper combines reinforcement learning and imitation learning and proposes a novel GAIL framework called generative adversarial imitation learning based on control sharing policy(GACS).GACS learns model constraints from expert samples and uses adversarial networks to guide learning directly.The actions are produced by adversarial networks and are used to optimize the policy and effectively improve learning efficiency.Experiments in the autonomous driving environment and the real-time strategy game breakout show that GACS has better generalization capabilities,more efficient imitation of the behavior of experts,and can learn better policies relative to other frameworks.展开更多
This paper studies imitation learning in nonlinear multi-player game systems with heterogeneous control input dynamics.We propose a model-free data-driven inverse reinforcement learning(RL)algorithm for a leaner to fi...This paper studies imitation learning in nonlinear multi-player game systems with heterogeneous control input dynamics.We propose a model-free data-driven inverse reinforcement learning(RL)algorithm for a leaner to find the cost functions of a N-player Nash expert system given the expert's states and control inputs.This allows us to address the imitation learning problem without prior knowledge of the expert's system dynamics.To achieve this,we provide a basic model-based algorithm that is built upon RL and inverse optimal control.This serves as the foundation for our final model-free inverse RL algorithm which is implemented via neural network-based value function approximators.Theoretical analysis and simulation examples verify the methods.展开更多
Imitation learning is a control design paradigm that seeks to learn a control policy reproducing demonstrations from expert agents.By substituting expert demonstrations for optimal behaviours,the same paradigm leads t...Imitation learning is a control design paradigm that seeks to learn a control policy reproducing demonstrations from expert agents.By substituting expert demonstrations for optimal behaviours,the same paradigm leads to the design of control policies closely approximating the optimal state-feedback.This approach requires training a machine learning algorithm(in our case deep neural networks)directly on state-control pairs originating from optimal trajectories.We have shown in previous work that,when restricted to low-dimensional state and control spaces,this approach is very successful in several deterministic,non-linear problems in continuous-time.In this work,we refine our previous studies using as a test case a simple quadcopter model with quadratic and time-optimal objective functions.We describe in detail the best learning pipeline we have developed,that is able to approximate via deep neural networks the state-feedback map to a very high accuracy.We introduce the use of the softplus activation function in the hidden units of neural networks showing that it results in a smoother control profile whilst retaining the benefits of rectifiers.We show how to evaluate the optimality of the trained state-feedback,and find that already with two layers the objective function reached and its optimal value differ by less than one percent.We later consider also an additional metric linked to the system asymptotic behaviour-time taken to converge to the policy’s fixed point.With respect to these metrics,we show that improvements in the mean absolute error do not necessarily correspond to better policies.展开更多
基金supported by the Open Project of Xiangjiang Laboratory (22XJ02003)Scientific Project of the National University of Defense Technology (NUDT)(ZK21-07, 23-ZZCX-JDZ-28)+1 种基金the National Science Fund for Outstanding Young Scholars (62122093)the National Natural Science Foundation of China (72071205)。
文摘Traditional expert-designed branching rules in branch-and-bound(B&B) are static, often failing to adapt to diverse and evolving problem instances. Crafting these rules is labor-intensive, and may not scale well with complex problems.Given the frequent need to solve varied combinatorial optimization problems, leveraging statistical learning to auto-tune B&B algorithms for specific problem classes becomes attractive. This paper proposes a graph pointer network model to learn the branch rules. Graph features, global features and historical features are designated to represent the solver state. The graph neural network processes graph features, while the pointer mechanism assimilates the global and historical features to finally determine the variable on which to branch. The model is trained to imitate the expert strong branching rule by a tailored top-k Kullback-Leibler divergence loss function. Experiments on a series of benchmark problems demonstrate that the proposed approach significantly outperforms the widely used expert-designed branching rules. It also outperforms state-of-the-art machine-learning-based branch-and-bound methods in terms of solving speed and search tree size on all the test instances. In addition, the model can generalize to unseen instances and scale to larger instances.
基金supported in part by the National Natural Science Foundation of China under Grant 61971084 and Grant 62001073in part by the National Natural Science Foundation of Chongqing under Grant cstc2019jcyj-msxmX0208in part by the open research fund of National Mobile Communications Research Laboratory,Southeast University,under Grant 2020D05.
文摘Mobile Edge Computing(MEC)is promising to alleviate the computation and storage burdens for terminals in wireless networks.The huge energy consumption of MEC servers challenges the establishment of smart cities and their service time powered by rechargeable batteries.In addition,Orthogonal Multiple Access(OMA)technique cannot utilize limited spectrum resources fully and efficiently.Therefore,Non-Orthogonal Multiple Access(NOMA)-based energy-efficient task scheduling among MEC servers for delay-constraint mobile applications is important,especially in highly-dynamic vehicular edge computing networks.The various movement patterns of vehicles lead to unbalanced offloading requirements and different load pressure for MEC servers.Self-Imitation Learning(SIL)-based Deep Reinforcement Learning(DRL)has emerged as a promising machine learning technique to break through obstacles in various research fields,especially in time-varying networks.In this paper,we first introduce related MEC technologies in vehicular networks.Then,we propose an energy-efficient approach for task scheduling in vehicular edge computing networks based on DRL,with the purpose of both guaranteeing the task latency requirement for multiple users and minimizing total energy consumption of MEC servers.Numerical results demonstrate that the proposed algorithm outperforms other methods.
文摘Providing autonomous systems with an effective quantity and quality of information from a desired task is challenging. In particular, autonomous vehicles, must have a reliable vision of their workspace to robustly accomplish driving functions. Speaking of machine vision, deep learning techniques, and specifically convolutional neural networks, have been proven to be the state of the art technology in the field. As these networks typically involve millions of parameters and elements, designing an optimal architecture for deep learning structures is a difficult task which is globally under investigation by researchers. This study experimentally evaluates the impact of three major architectural properties of convolutional networks, including the number of layers, filters, and filter size on their performance. In this study, several models with different properties are developed,equally trained, and then applied to an autonomous car in a realistic simulation environment. A new ensemble approach is also proposed to calculate and update weights for the models regarding their mean squared error values. Based on design properties,performance results are reported and compared for further investigations. Surprisingly, the number of filters itself does not largely affect the performance efficiency. As a result, proper allocation of filters with different kernel sizes through the layers introduces a considerable improvement in the performance.Achievements of this study will provide the researchers with a clear clue and direction in designing optimal network architectures for deep learning purposes.
基金This work was supported in part by the National Science Foundation of China under Grant No.61972432the Program for Guangdong Introduc⁃ing Innovative and Entrepreneurial Teams under Grant No.2017ZT07X355.
文摘Edge computation offloading allows mobile end devices to execute compute-inten?sive tasks on edge servers. End devices can decide whether the tasks are offloaded to edge servers, cloud servers or executed locally according to current network condition and devic?es'profiles in an online manner. In this paper, we propose an edge computation offloading framework based on deep imitation learning (DIL) and knowledge distillation (KD), which assists end devices to quickly make fine-grained decisions to optimize the delay of computa?tion tasks online. We formalize a computation offloading problem into a multi-label classifi?cation problem. Training samples for our DIL model are generated in an offline manner. Af?ter the model is trained, we leverage KD to obtain a lightweight DIL model, by which we fur?ther reduce the model's inference delay. Numerical experiment shows that the offloading de?cisions made by our model not only outperform those made by other related policies in laten?cy metric, but also have the shortest inference delay among all policies.
基金supported in part by the National Natural Science Foundation of China(No.52177119).
文摘The intermittency of renewable energy generation,variability of load demand,and stochasticity of market price bring about direct challenges to optimal energy management of microgrids.To cope with these different forms of operation uncertainties,an imitation learning based real-time decision-mak-ing solution for microgrid economic dispatch is proposed.In this solution,the optimal dispatch trajectories obtained by solving the optimal problem using historical deterministic operation patterns are demonstrated as the expert samples for imitation learning.To improve the generalization performance of imitation learning and the expressive ability of uncertain variables,a hybrid model combining the unsupervised and supervised learning is utilized.The denoising autoencoder based unsupervised learning model is adopted to enhance the feature extraction of operation patterns.Furthermore,the long short-term memory network based supervised learning model is used to efficiently characterize the mapping between the input space composed of the extracted operation patterns and system state variables and the output space composed of the optimal dispatch trajectories.The numerical simulation results demonstrate that under various operation uncertainties,the operation cost achieved by the proposed solution is close to the minimum theoretical value.Compared with the traditional model predictive control method and basic clone imitation learning method,the operation cost of the proposed solution is reduced by 6.3% and 2.8%,respectively,overa test period of three months.
基金supported by the National Natural Science Foundation of China(Grant No.91848202)the Foundation for Innovative Research Groups of the National Natural Science Foundation of China(Grant No.51521003)。
文摘Learning from demonstration(LfD)is an appealing method of helping robots learn new skills.Numerous papers have presented methods of LfD with good performance in robotics.However,complicated robot tasks that need to carefully regulate path planning strategies remain unanswered.Contact or non-contact constraints in specific robot tasks make the path planning problem more difficult,as the interaction between the robot and the environment is time-varying.In this paper,we focus on the path planning of complex robot tasks in the domain of LfD and give a novel perspective for classifying imitation learning and inverse reinforcement learning.This classification is based on constraints and obstacle avoidance.Finally,we summarize these methods and present promising directions for robot application and LfD theory.
基金supported by State Grid Science and Technology Program“Research on Real-time Autonomous Control Strategies for Power Grid Based on AI Technologies”(No.5700-201958523A-0-0-00)
文摘With the increasing penetration of renewable energy,power grid operators are observing both fast and large fluctuations in power and voltage profiles on a daily basis.Fast and accurate control actions derived in real time are vital to ensure system security and economics.To this end,solving alternating current(AC)optimal power flow(OPF)with operational constraints remains an important yet challenging optimization problem for secure and economic operation of the power grid.This paper adopts a novel method to derive fast OPF solutions using state-of-the-art deep reinforcement learning(DRL)algorithm,which can greatly assist power grid operators in making rapid and effective decisions.The presented method adopts imitation learning to generate initial weights for the neural network(NN),and a proximal policy optimization algorithm to train and test stable and robust artificial intelligence(AI)agents.Training and testing procedures are conducted on the IEEE 14-bus and the Illinois 200-bus systems.The results show the effectiveness of the method with significant potential for assisting power grid operators in real-time operations.
文摘We propose a new framework for entity and event extraction based on generative adversarial imitation learning-an inverse reinforcement learning method using a generative adversarial network(GAN).We assume that instances and labels yield to various extents of difficulty and the gains and penalties(rewards)are expected to be diverse.We utilize discriminators to estimate proper rewards according to the difference between the labels committed by the ground-truth(expert)and the extractor(agent).Our experiments demonstrate that the proposed framework outperforms state-of-the-art methods.
基金supported by the National Natural Science Foundation of China(Grant Nos.12072088,62003117,and 62003118)the National Defense Basic Scientific Research Program of China(Grant No.JCKY2020603B010)the Natural Science Foundation of Heilongjiang Province,China(Grant No.ZD2020F001)。
文摘This paper is concerned with the problems of mapless navigation for unmanned aerial vehicles in the scenarios with limited sensor accuracy and computing capability.A novel learning-based algorithm called soft actor-critic from demonstrations(SACfD)is proposed,integrating reinforcement learning with imitation learning.Specifically,the maximum entropy reinforcement learning framework is introduced to enhance the exploration capability of the algorithm,upon which the paper explores a way to sufficiently leverage demonstration data to significantly accelerate the convergence rate while improving policy performance reliably.Further,the proposed algorithm enables an implementation of mapless navigation for unmanned aerial vehicles and experimental results show that it outperforms the existing algorithms.
基金This work was authored in part by the National Renewable Energy Laboratory,United States,operated by Alliance for Sustainable Energy,LLC,for the U.S.Department of Energy(DOE)under Contract No.DE-AC36-08GO28308.
文摘Reinforcement learning(RL)has shown significant success in sequential decision making in fields like autonomous vehicles,robotics,marketing and gaming industries.This success has attracted the attention to the RL control approach for building energy systems which are becoming complicated due to the need to optimize for multiple,potentially conflicting,goals like occupant comfort,energy use and grid interactivity.However,for real world applications,RL has several drawbacks like requiring large training data and time,and unstable control behavior during the early exploration process making it infeasible for an application directly to building control tasks.To address these issues,an imitation learning approach is utilized herein where the RL agents starts with a policy transferred from accepted rule based policies and heuristic policies.This approach is successful in reducing the training time,preventing the unstable early exploration behavior and improving upon an accepted rule-based policy-all of these make RL a more practical control approach for real world applications in the domain of building controls.
基金Supported in Part by the National Natural Science Foundation of China (U1808206)。
文摘Generative adversarial imitation learning(GAIL)directly imitates the behavior of experts from human demonstration instead of designing explicit reward signals like reinforcement learning.Meanwhile,GAIL overcomes the defects of traditional imitation learning by using a generative adversary network framework and shows excellent performance in many fields.However,GAIL directly acts on immediate rewards,a feature that is reflected in the value function after a period of accumulation.Thus,when faced with complex practical problems,the learning efficiency of GAIL is often extremely low and the policy may be slow to learn.One way to solve this problem is to directly guide the action(policy)in the agents'learning process,such as the control sharing(CS)method.This paper combines reinforcement learning and imitation learning and proposes a novel GAIL framework called generative adversarial imitation learning based on control sharing policy(GACS).GACS learns model constraints from expert samples and uses adversarial networks to guide learning directly.The actions are produced by adversarial networks and are used to optimize the policy and effectively improve learning efficiency.Experiments in the autonomous driving environment and the real-time strategy game breakout show that GACS has better generalization capabilities,more efficient imitation of the behavior of experts,and can learn better policies relative to other frameworks.
文摘This paper studies imitation learning in nonlinear multi-player game systems with heterogeneous control input dynamics.We propose a model-free data-driven inverse reinforcement learning(RL)algorithm for a leaner to find the cost functions of a N-player Nash expert system given the expert's states and control inputs.This allows us to address the imitation learning problem without prior knowledge of the expert's system dynamics.To achieve this,we provide a basic model-based algorithm that is built upon RL and inverse optimal control.This serves as the foundation for our final model-free inverse RL algorithm which is implemented via neural network-based value function approximators.Theoretical analysis and simulation examples verify the methods.
文摘Imitation learning is a control design paradigm that seeks to learn a control policy reproducing demonstrations from expert agents.By substituting expert demonstrations for optimal behaviours,the same paradigm leads to the design of control policies closely approximating the optimal state-feedback.This approach requires training a machine learning algorithm(in our case deep neural networks)directly on state-control pairs originating from optimal trajectories.We have shown in previous work that,when restricted to low-dimensional state and control spaces,this approach is very successful in several deterministic,non-linear problems in continuous-time.In this work,we refine our previous studies using as a test case a simple quadcopter model with quadratic and time-optimal objective functions.We describe in detail the best learning pipeline we have developed,that is able to approximate via deep neural networks the state-feedback map to a very high accuracy.We introduce the use of the softplus activation function in the hidden units of neural networks showing that it results in a smoother control profile whilst retaining the benefits of rectifiers.We show how to evaluate the optimality of the trained state-feedback,and find that already with two layers the objective function reached and its optimal value differ by less than one percent.We later consider also an additional metric linked to the system asymptotic behaviour-time taken to converge to the policy’s fixed point.With respect to these metrics,we show that improvements in the mean absolute error do not necessarily correspond to better policies.