Aimed at infinite horizon optimal control problems of discrete time-varying nonlinear systems,in this paper,a new iterative adaptive dynamic programming algorithm,which is the discrete-time time-varying policy iterati...Aimed at infinite horizon optimal control problems of discrete time-varying nonlinear systems,in this paper,a new iterative adaptive dynamic programming algorithm,which is the discrete-time time-varying policy iteration(DTTV)algorithm,is developed.The iterative control law is designed to update the iterative value function which approximates the index function of optimal performance.The admissibility of the iterative control law is analyzed.The results show that the iterative value function is non-increasingly convergent to the Bellman-equation optimal solution.To implement the algorithm,neural networks are employed and a new implementation structure is established,which avoids solving the generalized Bellman equation in each iteration.Finally,the optimal control laws for torsional pendulum and inverted pendulum systems are obtained by using the DTTV policy iteration algorithm,where the mass and pendulum bar length are permitted to be time-varying parameters.The effectiveness of the developed method is illustrated by numerical results and comparisons.展开更多
In order to address the output feedback issue for linear discrete-time systems, this work suggests a brand-new adaptive dynamic programming(ADP) technique based on the internal model principle(IMP). The proposed metho...In order to address the output feedback issue for linear discrete-time systems, this work suggests a brand-new adaptive dynamic programming(ADP) technique based on the internal model principle(IMP). The proposed method, termed as IMP-ADP, does not require complete state feedback-merely the measurement of input and output data. More specifically, based on the IMP, the output control problem can first be converted into a stabilization problem. We then design an observer to reproduce the full state of the system by measuring the inputs and outputs. Moreover, this technique includes both a policy iteration algorithm and a value iteration algorithm to determine the optimal feedback gain without using a dynamic system model. It is important that with this concept one does not need to solve the regulator equation. Finally, this control method was tested on an inverter system of grid-connected LCLs to demonstrate that the proposed method provides the desired performance in terms of both tracking and disturbance rejection.展开更多
The libration control problem of space tether system(STS)for post-capture of payload is studied.The process of payload capture will cause tether swing and deviation from the nominal position,resulting in the failure o...The libration control problem of space tether system(STS)for post-capture of payload is studied.The process of payload capture will cause tether swing and deviation from the nominal position,resulting in the failure of capture mission.Due to unknown inertial parameters after capturing the payload,an adaptive optimal control based on policy iteration is developed to stabilize the uncertain dynamic system in the post-capture phase.By introducing integral reinforcement learning(IRL)scheme,the algebraic Riccati equation(ARE)can be online solved without known dynamics.To avoid computational burden from iteration equations,the online implementation of policy iteration algorithm is provided by the least-squares solution method.Finally,the effectiveness of the algorithm is validated by numerical simulations.展开更多
We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration(PI),i.e.,start from some base policy and generate an improved policy.Rollout is the simplest...We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration(PI),i.e.,start from some base policy and generate an improved policy.Rollout is the simplest method of this type,where just one improved policy is generated.We can view PI as repeated application of rollout,where the rollout policy at each iteration serves as the base policy for the next iteration.In contrast with PI,rollout has a robustness property:it can be applied on-line and is suitable for on-line replanning.Moreover,rollout can use as base policy one of the policies produced by PI,thereby improving on that policy.This is the type of scheme underlying the prominently successful Alpha Zero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected(conceptually)by a separate agent.This is the class of multiagent problems where the agents have a shared objective function,and a shared and perfect state information.Based on a problem reformulation that trades off control space complexity with state space complexity,we develop an approach,whereby at every stage,the agents sequentially(one-at-a-time)execute a local rollout algorithm that uses a base policy,together with some coordinating information from the other agents.The amount of total computation required at every stage grows linearly with the number of agents.By contrast,in the standard rollout algorithm,the amount of total computation grows exponentially with the number of agents.Despite the dramatic reduction in required computation,we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout:it guarantees an improved performance relative to the base policy.We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information,which is sufficient to maintain the cost improvement property,without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems,we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation.For one of our PI algorithms,we prove convergence to an agentby-agent optimal policy,thus establishing a connection with the theory of teams.For another PI algorithm,which is executed over a more complex state space,we prove convergence to an optimal policy.Approximate forms of these algorithms are also given,based on the use of policy and value neural networks.These PI algorithms,in both their exact and their approximate form are strictly off-line methods,but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.展开更多
Bolt assembly by robots is a vital and difficult task for replacing astronauts in extravehicular activities(EVA),but the trajectory efficiency still needs to be improved during the wrench insertion into hex hole of bo...Bolt assembly by robots is a vital and difficult task for replacing astronauts in extravehicular activities(EVA),but the trajectory efficiency still needs to be improved during the wrench insertion into hex hole of bolt.In this paper,a policy iteration method based on reinforcement learning(RL)is proposed,by which the problem of trajectory efficiency improvement is constructed as an issue of RL-based objective optimization.Firstly,the projection relation between raw data and state-action space is established,and then a policy iteration initialization method is designed based on the projection to provide the initialization policy for iteration.Policy iteration based on the protective policy is applied to continuously evaluating and optimizing the action-value function of all state-action pairs till the convergence is obtained.To verify the feasibility and effectiveness of the proposed method,a noncontact demonstration experiment with human supervision is performed.Experimental results show that the initialization policy and the generated policy can be obtained by the policy iteration method in a limited number of demonstrations.A comparison between the experiments with two different assembly tolerances shows that the convergent generated policy possesses higher trajectory efficiency than the conservative one.In addition,this method can ensure safety during the training process and improve utilization efficiency of demonstration data.展开更多
It is known that the performance potentials (or equivalentiy, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systems, the changes in ...It is known that the performance potentials (or equivalentiy, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systems, the changes in parameters may only affect some states, and the explicit transition probability matrix may not be known. In this paper, we use an example to show that we can use potentials to construct performance sensitivities in a more flexible way; only the potentials at the affected states need to be estimated, and the transition probability matrix need not be known. Policy iteration algorithms, which are simpler than the standard one, can be established.展开更多
In this paper,we study the robustness property of policy optimization(particularly Gauss-Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning)subject to noise at each...In this paper,we study the robustness property of policy optimization(particularly Gauss-Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning)subject to noise at each iteration.By invoking the concept of input-to-state stability and utilizing Lyapunov's direct method,it is shown that,if the noise is sufficiently small,the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration.Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided.Based on Willems'fundamental lemma,a learning-based policy iteration algorithm is proposed.The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal.The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration.Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.展开更多
In this paper, a novel iterative Q-learning algorithm, called "policy iteration based deterministic Qlearning algorithm", is developed to solve the optimal control problems for discrete-time deterministic no...In this paper, a novel iterative Q-learning algorithm, called "policy iteration based deterministic Qlearning algorithm", is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterative adaptive dynamic programming(ADP) technique to construct the iterative control law which optimizes the iterative Q function. When the optimal Q function is obtained, the optimal control law can be achieved by directly minimizing the optimal Q function, where the mathematical model of the system is not necessary. Convergence property is analyzed to show that the iterative Q function is monotonically non-increasing and converges to the solution of the optimality equation. It is also proven that any of the iterative control laws is a stable control law. Neural networks are employed to implement the policy iteration based deterministic Q-learning algorithm, by approximating the iterative Q function and the iterative control law, respectively. Finally, two simulation examples are presented to illustrate the performance of the developed algorithm.展开更多
We consider the classical policy iteration method of dynamic programming(DP),where approximations and simulation are used to deal with the curse of dimensionality.We survey a number of issues:convergence and rate of c...We consider the classical policy iteration method of dynamic programming(DP),where approximations and simulation are used to deal with the curse of dimensionality.We survey a number of issues:convergence and rate of convergence of approximate policy evaluation methods,singularity and susceptibility to simulation noise of policy evaluation,exploration issues,constrained and enhanced policy iteration,policy oscillation and chattering,and optimistic and distributed policy iteration.Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches:projected equations and temporal differences(TD),and aggregation.In the context of these approaches,we survey two different types of simulation-based algorithms:matrix inversion methods,such as least-squares temporal difference(LSTD),and iterative methods,such as least-squares policy evaluation(LSPE) and TD(λ),and their scaled variants.We discuss a recent method,based on regression and regularization,which recti?es the unreliability of LSTD for nearly singular projected Bellman equations.An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE.Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees.We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation,but when done by aggregation it does not.This implies better error bounds and more regular performance for aggregation,at the expense of some loss of generality in cost function representation capability.Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation,and is characterized by favorable error bounds.展开更多
This paper studies the policy iteration algorithm(PIA)for zero-sum stochastic differential games with the basic long-run average criterion,as well as with its more selective version,the so-called bias criterion.The sy...This paper studies the policy iteration algorithm(PIA)for zero-sum stochastic differential games with the basic long-run average criterion,as well as with its more selective version,the so-called bias criterion.The system is assumed to be a nondegenerate diffusion.We use Lyapunov-like stability conditions that ensure the existence and boundedness of the solution to certain Poisson equation.We also ensure the convergence of a sequence of such solutions,of the corresponding sequence of policies,and,ultimately,of the PIA.展开更多
In this paper,the optimal consensus control problem is investigated for heterogeneous linear multi-agent systems(MASs)with spanning tree condition based on game theory and rein-forcement learning.First,the graphical m...In this paper,the optimal consensus control problem is investigated for heterogeneous linear multi-agent systems(MASs)with spanning tree condition based on game theory and rein-forcement learning.First,the graphical minimax game algebraic Riccati equation(ARE)is derived by converting the consensus problem into a zero-sum game problem between each agent and its neighbors.The asymptotic stability and minimax validation of the closed-loop systems are proved theoretically.Then,a data-driven off-policy reinforcement learning algorithm is proposed to online learn the optimal control policy without the information of the system dynamics.A certain rank condition is established to guarantee the convergence of the proposed algorithm to the unique solution of the ARE.Finally,the e®ectiveness of the proposed method is demonstrated through a numerical simulation.展开更多
This paper introduces a model-free reinforcement learning technique that is used to solve a class of dynamic games known as dynamic graphical games. The graphical game results from to make all the agents synchronize t...This paper introduces a model-free reinforcement learning technique that is used to solve a class of dynamic games known as dynamic graphical games. The graphical game results from to make all the agents synchronize to the state of a command multi-agent dynamical systems, where pinning control is used generator or a leader agent. Novel coupled Bellman equations and Hamiltonian functions are developed for the dynamic graphical games. The Hamiltonian mechanics are used to derive the necessary conditions for optimality. The solution for the dynamic graphical game is given in terms of the solution to a set of coupled Hamilton-Jacobi-Bellman equations developed herein. Nash equilibrium solution for the graphical game is given in terms of the solution to the underlying coupled Hamilton-Jacobi-Bellman equations. An online model-free policy iteration algorithm is developed to learn the Nash solution for the dynamic graphical game. This algorithm does not require any knowledge of the agents' dynamics. A proof of convergence for this multi-agent learning algorithm is given under mild assumption about the inter-connectivity properties of the graph. A gradient descent technique with critic network structures is used to implement the policy iteration algorithm to solve the graphical game online in real-time.展开更多
The rapid progress of cloud technology has attracted a growing number of video providers to consider deploying their streaming services onto cloud platform for more cost-effective, scalable and reliable performance. I...The rapid progress of cloud technology has attracted a growing number of video providers to consider deploying their streaming services onto cloud platform for more cost-effective, scalable and reliable performance. In this paper, we utilize Markov decision process model to formulate the dynamic deployment of cloud-based video services over multiple geographically distributed datacenters. We focus on maximizing the average profits for the video service provider over a long run and introduce an average performance criteria which reflects the cost and user experience jointly. We develop an optimal algorithm based on the sensitivity analysis and sample-based policy iteration to obtain the optimal video placement and request dispatching strategy. We demonstrate the optimality of our algorithm with theoretical proof and specify the practical feasibility of our algorithm. We conduct simulations to evaluate the performance of our algorithm and the results show that our strategy can effectively cut down the total cost and guarantee users' quality of experience (QoE).展开更多
The goal of this paper is to design a model-free optimal controller for the multirate system based on reinforcement learning.Sampled-data control systems are widely used in the industrial production process and multir...The goal of this paper is to design a model-free optimal controller for the multirate system based on reinforcement learning.Sampled-data control systems are widely used in the industrial production process and multirate sampling has attracted much attention in the study of the sampled-data control theory.In this paper,we assume the sampling periods for state variables are different from periods for system inputs.Under this condition,we can obtain an equivalent discrete-time system using the lifting technique.Then,we provide an algorithm to solve the linear quadratic regulator(LQR)control problem of multirate systems with the utilization of matrix substitutions.Based on a reinforcement learning method,we use online policy iteration and off-policy algorithms to optimize the controller for multirate systems.By using the least squares method,we convert the off-policy algorithm into a model-free reinforcement learning algorithm,which only requires the input and output data of the system.Finally,we use an example to illustrate the applicability and efficiency of the model-free algorithm above mentioned.展开更多
This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into ...This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into a multi-agent nonzero-sum game.Then,the Nash equilibrium can be achieved by solving the coupled Hamilton–Jacobi–Bellman(HJB)equations with nonquadratic input energy terms.A novel off-policy reinforcement learning method is presented to obtain the Nash equilibrium solution without the system models,and the critic neural networks(NNs)and actor NNs are introduced to implement the presented method.Theoretical analysis is provided,which shows that the iterative control laws converge to the Nash equilibrium.Simulation results show the good performance of the presented method.展开更多
The H_(∞)control method is an effective approach for attenuating the effect of disturbances on practical systems, but it is difficult to obtain the H_(∞)controller due to the nonlinear Hamilton-Jacobi-Isaacs equatio...The H_(∞)control method is an effective approach for attenuating the effect of disturbances on practical systems, but it is difficult to obtain the H_(∞)controller due to the nonlinear Hamilton-Jacobi-Isaacs equation, even for linear systems. This study deals with the design of an H_(∞)controller for linear discrete-time systems. To solve the related game algebraic Riccati equation(GARE), a novel model-free minimax Q-learning method is developed, on the basis of an offline policy iteration algorithm, which is shown to be Newton’s method for solving the GARE. The proposed minimax Q-learning method, which employs off-policy reinforcement learning, learns the optimal control policies for the controller and the disturbance online, using only the state samples generated by the implemented behavior policies. Different from existing Q-learning methods, a novel gradient-based policy improvement scheme is proposed. We prove that the minimax Q-learning method converges to the saddle solution under initially admissible control policies and an appropriate positive learning rate, provided that certain persistence of excitation(PE)conditions are satisfied. In addition, the PE conditions can be easily met by choosing appropriate behavior policies containing certain excitation noises, without causing any excitation noise bias. In the simulation study, we apply the proposed minimax Q-learning method to design an H_(∞)load-frequency controller for an electrical power system generator that suffers from load disturbance, and the simulation results indicate that the obtained H_(∞)load-frequency controller has good disturbance rejection performance.展开更多
The solution of minimum-time feedback optimal control problems is generally achieved using the dynamic programming approach,in which the value function must be computed on numerical grids with a very large number of p...The solution of minimum-time feedback optimal control problems is generally achieved using the dynamic programming approach,in which the value function must be computed on numerical grids with a very large number of points.Classical numerical strategies,such as value iteration(VI)or policy iteration(PI)methods,become very inefficient if the number of grid points is large.This is a strong limitation to their use in real-world applications.To address this problem,the authors present a novel multilevel framework,where classical VI and PI are embedded in a full-approximation storage(FAS)scheme.In fact,the authors will show that VI and PI have excellent smoothing properties,a fact that makes them very suitable for use in multilevel frameworks.Moreover,a new smoother is developed by accelerating VI using Anderson’s extrapolation technique.The effectiveness of our new scheme is demonstrated by several numerical experiments.展开更多
This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not gre...This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not greater than a given initial threshold value, and the second is a probability that the total reward is less than it. Our first (resp. second) optimizing problem is to minimize the first (resp. second) threshold probability. These problems suggest that the threshold value is a permissible level of the total reward to reach a goal (the target set), that is, we would reach this set over the level, if possible. For the both problems, we show that 1) the optimal threshold probability is a unique solution to an optimality equation, 2) there exists an optimal deterministic stationary policy, and 3) a value iteration and a policy space iteration are given. In addition, we prove that the first (resp. second) optimal threshold probability is a monotone increasing and right (resp. left) continuous function of the initial threshold value and propose a method to obtain an optimal policy and the optimal threshold probability in the first problem by using them in the second problem.展开更多
The optimal control of a Markov jump linear quadratic model with controlled jump probabilities of modes is investigated.Two kinds of mode control policies,i.e.,open-loop control policy and closed-loop control policy,a...The optimal control of a Markov jump linear quadratic model with controlled jump probabilities of modes is investigated.Two kinds of mode control policies,i.e.,open-loop control policy and closed-loop control policy,are considered.Using the concepts of policy iteration and performance potential,the sufficient condition needed for the optimal closed-loop control policy to perform better than the optimal open-loop control policy is proposed.The condition is helpful for the design of an optimal controller.Furthermore,an efficient algorithm to construct a closed-loop control policy,which is better than the optimal open-loop control policy,is given with policy iteration.展开更多
基金supported in part by Fundamental Research Funds for the Central Universities(2022JBZX024)in part by the National Natural Science Foundation of China(61872037,61273167)。
文摘Aimed at infinite horizon optimal control problems of discrete time-varying nonlinear systems,in this paper,a new iterative adaptive dynamic programming algorithm,which is the discrete-time time-varying policy iteration(DTTV)algorithm,is developed.The iterative control law is designed to update the iterative value function which approximates the index function of optimal performance.The admissibility of the iterative control law is analyzed.The results show that the iterative value function is non-increasingly convergent to the Bellman-equation optimal solution.To implement the algorithm,neural networks are employed and a new implementation structure is established,which avoids solving the generalized Bellman equation in each iteration.Finally,the optimal control laws for torsional pendulum and inverted pendulum systems are obtained by using the DTTV policy iteration algorithm,where the mass and pendulum bar length are permitted to be time-varying parameters.The effectiveness of the developed method is illustrated by numerical results and comparisons.
基金supported by the National Science Fund for Distinguished Young Scholars (62225303)the Fundamental Research Funds for the Central Universities (buctrc202201)+1 种基金China Scholarship Council,and High Performance Computing PlatformCollege of Information Science and Technology,Beijing University of Chemical Technology。
文摘In order to address the output feedback issue for linear discrete-time systems, this work suggests a brand-new adaptive dynamic programming(ADP) technique based on the internal model principle(IMP). The proposed method, termed as IMP-ADP, does not require complete state feedback-merely the measurement of input and output data. More specifically, based on the IMP, the output control problem can first be converted into a stabilization problem. We then design an observer to reproduce the full state of the system by measuring the inputs and outputs. Moreover, this technique includes both a policy iteration algorithm and a value iteration algorithm to determine the optimal feedback gain without using a dynamic system model. It is important that with this concept one does not need to solve the regulator equation. Finally, this control method was tested on an inverter system of grid-connected LCLs to demonstrate that the proposed method provides the desired performance in terms of both tracking and disturbance rejection.
基金supported by the National Natural Science Foundation of China(No.62111530051)the Fundamental Research Funds for the Central Universities(No.3102017JC06002)the Shaanxi Science and Technology Program,China(No.2017KW-ZD-04).
文摘The libration control problem of space tether system(STS)for post-capture of payload is studied.The process of payload capture will cause tether swing and deviation from the nominal position,resulting in the failure of capture mission.Due to unknown inertial parameters after capturing the payload,an adaptive optimal control based on policy iteration is developed to stabilize the uncertain dynamic system in the post-capture phase.By introducing integral reinforcement learning(IRL)scheme,the algebraic Riccati equation(ARE)can be online solved without known dynamics.To avoid computational burden from iteration equations,the online implementation of policy iteration algorithm is provided by the least-squares solution method.Finally,the effectiveness of the algorithm is validated by numerical simulations.
文摘We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration(PI),i.e.,start from some base policy and generate an improved policy.Rollout is the simplest method of this type,where just one improved policy is generated.We can view PI as repeated application of rollout,where the rollout policy at each iteration serves as the base policy for the next iteration.In contrast with PI,rollout has a robustness property:it can be applied on-line and is suitable for on-line replanning.Moreover,rollout can use as base policy one of the policies produced by PI,thereby improving on that policy.This is the type of scheme underlying the prominently successful Alpha Zero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected(conceptually)by a separate agent.This is the class of multiagent problems where the agents have a shared objective function,and a shared and perfect state information.Based on a problem reformulation that trades off control space complexity with state space complexity,we develop an approach,whereby at every stage,the agents sequentially(one-at-a-time)execute a local rollout algorithm that uses a base policy,together with some coordinating information from the other agents.The amount of total computation required at every stage grows linearly with the number of agents.By contrast,in the standard rollout algorithm,the amount of total computation grows exponentially with the number of agents.Despite the dramatic reduction in required computation,we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout:it guarantees an improved performance relative to the base policy.We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information,which is sufficient to maintain the cost improvement property,without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems,we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation.For one of our PI algorithms,we prove convergence to an agentby-agent optimal policy,thus establishing a connection with the theory of teams.For another PI algorithm,which is executed over a more complex state space,we prove convergence to an optimal policy.Approximate forms of these algorithms are also given,based on the use of policy and value neural networks.These PI algorithms,in both their exact and their approximate form are strictly off-line methods,but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.
基金supported by the National Natural Science Foundation of China(No.91848202)the Special Foundation(Pre-Station)of China Postdoctoral Science(No.2021TQ0089)。
文摘Bolt assembly by robots is a vital and difficult task for replacing astronauts in extravehicular activities(EVA),but the trajectory efficiency still needs to be improved during the wrench insertion into hex hole of bolt.In this paper,a policy iteration method based on reinforcement learning(RL)is proposed,by which the problem of trajectory efficiency improvement is constructed as an issue of RL-based objective optimization.Firstly,the projection relation between raw data and state-action space is established,and then a policy iteration initialization method is designed based on the projection to provide the initialization policy for iteration.Policy iteration based on the protective policy is applied to continuously evaluating and optimizing the action-value function of all state-action pairs till the convergence is obtained.To verify the feasibility and effectiveness of the proposed method,a noncontact demonstration experiment with human supervision is performed.Experimental results show that the initialization policy and the generated policy can be obtained by the policy iteration method in a limited number of demonstrations.A comparison between the experiments with two different assembly tolerances shows that the convergent generated policy possesses higher trajectory efficiency than the conservative one.In addition,this method can ensure safety during the training process and improve utilization efficiency of demonstration data.
文摘It is known that the performance potentials (or equivalentiy, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systems, the changes in parameters may only affect some states, and the explicit transition probability matrix may not be known. In this paper, we use an example to show that we can use potentials to construct performance sensitivities in a more flexible way; only the potentials at the affected states need to be estimated, and the transition probability matrix need not be known. Policy iteration algorithms, which are simpler than the standard one, can be established.
基金supported in part by the National Science Foundation(Nos.ECCS-2210320,CNS-2148304).
文摘In this paper,we study the robustness property of policy optimization(particularly Gauss-Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning)subject to noise at each iteration.By invoking the concept of input-to-state stability and utilizing Lyapunov's direct method,it is shown that,if the noise is sufficiently small,the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration.Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided.Based on Willems'fundamental lemma,a learning-based policy iteration algorithm is proposed.The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal.The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration.Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.
基金supported in part by National Natural Science Foundation of China(Grant Nos.6137410561233001+1 种基金61273140)in part by Beijing Natural Science Foundation(Grant No.4132078)
文摘In this paper, a novel iterative Q-learning algorithm, called "policy iteration based deterministic Qlearning algorithm", is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterative adaptive dynamic programming(ADP) technique to construct the iterative control law which optimizes the iterative Q function. When the optimal Q function is obtained, the optimal control law can be achieved by directly minimizing the optimal Q function, where the mathematical model of the system is not necessary. Convergence property is analyzed to show that the iterative Q function is monotonically non-increasing and converges to the solution of the optimality equation. It is also proven that any of the iterative control laws is a stable control law. Neural networks are employed to implement the policy iteration based deterministic Q-learning algorithm, by approximating the iterative Q function and the iterative control law, respectively. Finally, two simulation examples are presented to illustrate the performance of the developed algorithm.
基金supported by the National Science Foundation (No.ECCS-0801549)the LANL Information Science and Technology Institutethe Air Force (No.FA9550-10-1-0412)
文摘We consider the classical policy iteration method of dynamic programming(DP),where approximations and simulation are used to deal with the curse of dimensionality.We survey a number of issues:convergence and rate of convergence of approximate policy evaluation methods,singularity and susceptibility to simulation noise of policy evaluation,exploration issues,constrained and enhanced policy iteration,policy oscillation and chattering,and optimistic and distributed policy iteration.Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches:projected equations and temporal differences(TD),and aggregation.In the context of these approaches,we survey two different types of simulation-based algorithms:matrix inversion methods,such as least-squares temporal difference(LSTD),and iterative methods,such as least-squares policy evaluation(LSPE) and TD(λ),and their scaled variants.We discuss a recent method,based on regression and regularization,which recti?es the unreliability of LSTD for nearly singular projected Bellman equations.An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE.Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees.We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation,but when done by aggregation it does not.This implies better error bounds and more regular performance for aggregation,at the expense of some loss of generality in cost function representation capability.Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation,and is characterized by favorable error bounds.
文摘This paper studies the policy iteration algorithm(PIA)for zero-sum stochastic differential games with the basic long-run average criterion,as well as with its more selective version,the so-called bias criterion.The system is assumed to be a nondegenerate diffusion.We use Lyapunov-like stability conditions that ensure the existence and boundedness of the solution to certain Poisson equation.We also ensure the convergence of a sequence of such solutions,of the corresponding sequence of policies,and,ultimately,of the PIA.
基金supported by the National Natural Science Foundation of China under Grant Nos.61803032 and 61873031。
文摘In this paper,the optimal consensus control problem is investigated for heterogeneous linear multi-agent systems(MASs)with spanning tree condition based on game theory and rein-forcement learning.First,the graphical minimax game algebraic Riccati equation(ARE)is derived by converting the consensus problem into a zero-sum game problem between each agent and its neighbors.The asymptotic stability and minimax validation of the closed-loop systems are proved theoretically.Then,a data-driven off-policy reinforcement learning algorithm is proposed to online learn the optimal control policy without the information of the system dynamics.A certain rank condition is established to guarantee the convergence of the proposed algorithm to the unique solution of the ARE.Finally,the e®ectiveness of the proposed method is demonstrated through a numerical simulation.
基金supported by the Deanship of Scientific Research at King Fahd University of Petroleum & Minerals Project(No.JF141002)the National Science Foundation(No.ECCS-1405173)+3 种基金the Office of Naval Research(Nos.N000141310562,N000141410718)the U.S. Army Research Office(No.W911NF-11-D-0001)the National Natural Science Foundation of China(No.61120106011)the Project 111 from the Ministry of Education of China(No.B08015)
文摘This paper introduces a model-free reinforcement learning technique that is used to solve a class of dynamic games known as dynamic graphical games. The graphical game results from to make all the agents synchronize to the state of a command multi-agent dynamical systems, where pinning control is used generator or a leader agent. Novel coupled Bellman equations and Hamiltonian functions are developed for the dynamic graphical games. The Hamiltonian mechanics are used to derive the necessary conditions for optimality. The solution for the dynamic graphical game is given in terms of the solution to a set of coupled Hamilton-Jacobi-Bellman equations developed herein. Nash equilibrium solution for the graphical game is given in terms of the solution to the underlying coupled Hamilton-Jacobi-Bellman equations. An online model-free policy iteration algorithm is developed to learn the Nash solution for the dynamic graphical game. This algorithm does not require any knowledge of the agents' dynamics. A proof of convergence for this multi-agent learning algorithm is given under mild assumption about the inter-connectivity properties of the graph. A gradient descent technique with critic network structures is used to implement the policy iteration algorithm to solve the graphical game online in real-time.
基金supported by the State Key Program of National Natural Science Foundation of China(No.61233003)National Natural Science Foundation of China(No.61503358)
文摘The rapid progress of cloud technology has attracted a growing number of video providers to consider deploying their streaming services onto cloud platform for more cost-effective, scalable and reliable performance. In this paper, we utilize Markov decision process model to formulate the dynamic deployment of cloud-based video services over multiple geographically distributed datacenters. We focus on maximizing the average profits for the video service provider over a long run and introduce an average performance criteria which reflects the cost and user experience jointly. We develop an optimal algorithm based on the sensitivity analysis and sample-based policy iteration to obtain the optimal video placement and request dispatching strategy. We demonstrate the optimality of our algorithm with theoretical proof and specify the practical feasibility of our algorithm. We conduct simulations to evaluate the performance of our algorithm and the results show that our strategy can effectively cut down the total cost and guarantee users' quality of experience (QoE).
基金This work was supported by National Key R&D Program of China(No.2018YFB1308404).
文摘The goal of this paper is to design a model-free optimal controller for the multirate system based on reinforcement learning.Sampled-data control systems are widely used in the industrial production process and multirate sampling has attracted much attention in the study of the sampled-data control theory.In this paper,we assume the sampling periods for state variables are different from periods for system inputs.Under this condition,we can obtain an equivalent discrete-time system using the lifting technique.Then,we provide an algorithm to solve the linear quadratic regulator(LQR)control problem of multirate systems with the utilization of matrix substitutions.Based on a reinforcement learning method,we use online policy iteration and off-policy algorithms to optimize the controller for multirate systems.By using the least squares method,we convert the off-policy algorithm into a model-free reinforcement learning algorithm,which only requires the input and output data of the system.Finally,we use an example to illustrate the applicability and efficiency of the model-free algorithm above mentioned.
基金Project supported by the National Key R&D Program of China(No.2018YFB1702300)the National Natural Science Foundation of China(Nos.61722312 and 61533017)。
文摘This paper presents a novel optimal synchronization control method for multi-agent systems with input saturation.The multi-agent game theory is introduced to transform the optimal synchronization control problem into a multi-agent nonzero-sum game.Then,the Nash equilibrium can be achieved by solving the coupled Hamilton–Jacobi–Bellman(HJB)equations with nonquadratic input energy terms.A novel off-policy reinforcement learning method is presented to obtain the Nash equilibrium solution without the system models,and the critic neural networks(NNs)and actor NNs are introduced to implement the presented method.Theoretical analysis is provided,which shows that the iterative control laws converge to the Nash equilibrium.Simulation results show the good performance of the presented method.
基金supported by the National Natural Science Foundation of China (No. U1613225)。
文摘The H_(∞)control method is an effective approach for attenuating the effect of disturbances on practical systems, but it is difficult to obtain the H_(∞)controller due to the nonlinear Hamilton-Jacobi-Isaacs equation, even for linear systems. This study deals with the design of an H_(∞)controller for linear discrete-time systems. To solve the related game algebraic Riccati equation(GARE), a novel model-free minimax Q-learning method is developed, on the basis of an offline policy iteration algorithm, which is shown to be Newton’s method for solving the GARE. The proposed minimax Q-learning method, which employs off-policy reinforcement learning, learns the optimal control policies for the controller and the disturbance online, using only the state samples generated by the implemented behavior policies. Different from existing Q-learning methods, a novel gradient-based policy improvement scheme is proposed. We prove that the minimax Q-learning method converges to the saddle solution under initially admissible control policies and an appropriate positive learning rate, provided that certain persistence of excitation(PE)conditions are satisfied. In addition, the PE conditions can be easily met by choosing appropriate behavior policies containing certain excitation noises, without causing any excitation noise bias. In the simulation study, we apply the proposed minimax Q-learning method to design an H_(∞)load-frequency controller for an electrical power system generator that suffers from load disturbance, and the simulation results indicate that the obtained H_(∞)load-frequency controller has good disturbance rejection performance.
文摘The solution of minimum-time feedback optimal control problems is generally achieved using the dynamic programming approach,in which the value function must be computed on numerical grids with a very large number of points.Classical numerical strategies,such as value iteration(VI)or policy iteration(PI)methods,become very inefficient if the number of grid points is large.This is a strong limitation to their use in real-world applications.To address this problem,the authors present a novel multilevel framework,where classical VI and PI are embedded in a full-approximation storage(FAS)scheme.In fact,the authors will show that VI and PI have excellent smoothing properties,a fact that makes them very suitable for use in multilevel frameworks.Moreover,a new smoother is developed by accelerating VI using Anderson’s extrapolation technique.The effectiveness of our new scheme is demonstrated by several numerical experiments.
文摘This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not greater than a given initial threshold value, and the second is a probability that the total reward is less than it. Our first (resp. second) optimizing problem is to minimize the first (resp. second) threshold probability. These problems suggest that the threshold value is a permissible level of the total reward to reach a goal (the target set), that is, we would reach this set over the level, if possible. For the both problems, we show that 1) the optimal threshold probability is a unique solution to an optimality equation, 2) there exists an optimal deterministic stationary policy, and 3) a value iteration and a policy space iteration are given. In addition, we prove that the first (resp. second) optimal threshold probability is a monotone increasing and right (resp. left) continuous function of the initial threshold value and propose a method to obtain an optimal policy and the optimal threshold probability in the first problem by using them in the second problem.
基金supported by the National Natural Science Foundation of China (Grant Nos.60574064,60736027).
文摘The optimal control of a Markov jump linear quadratic model with controlled jump probabilities of modes is investigated.Two kinds of mode control policies,i.e.,open-loop control policy and closed-loop control policy,are considered.Using the concepts of policy iteration and performance potential,the sufficient condition needed for the optimal closed-loop control policy to perform better than the optimal open-loop control policy is proposed.The condition is helpful for the design of an optimal controller.Furthermore,an efficient algorithm to construct a closed-loop control policy,which is better than the optimal open-loop control policy,is given with policy iteration.