In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinfo...In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision(HADS)which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism.HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning.Due to the fact that not all sub-goal points are reachable,a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm.HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals,while those that are not will be removed based on prior knowledge.Our experiments verified the effect of the algorithm.展开更多
The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchic...The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchical deep deterministic policy gradient(DDPG)algorithm.The reward functions are constructed to minimize the line-of-sight(LOS)angle rate and avoid the threat caused by the opposed obstacles.To attenuate the chattering of the acceleration,a hierarchical reinforcement learning structure and an improved reward function with action penalty are put forward.The simulation results validate that the missile under the proposed method can hit the target successfully and keep away from the threatened areas effectively.展开更多
Proposes a reinforcement learning scheme based on a special Hierarchical Fuzzy Neural-Networks (HFNN)for solving complicated learning tasks in a continuous multi-variables environment. The output of the previous layer...Proposes a reinforcement learning scheme based on a special Hierarchical Fuzzy Neural-Networks (HFNN)for solving complicated learning tasks in a continuous multi-variables environment. The output of the previous layer in the HFNN is no longer used as if-part of the next layer, but used only in then-part. Thus it can deal with the difficulty when the output of the previous layer is meaningless or its meaning is uncertain. The proposed HFNN has a minimal number of fuzzy rules and can successfully solve the problem of rules combination explosion and decrease the quantity of computation and memory requirement. In the learning process, two HFNN with the same structure perform fuzzy action composition and evaluation function approximation simultaneously where the parameters of neural-networks are tuned and updated on line by using gradient descent algorithm. The reinforcement learning method is proved to be correct and feasible by simulation of a double inverted pendulum system.展开更多
Based on option-critic algorithm,a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent's performance against opponent with fixed offensive algorithm.A...Based on option-critic algorithm,a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent's performance against opponent with fixed offensive algorithm.An option network is introduced in upper level design,which can generate activated signal from defensive and of-fensive strategies according to temporary situation.Then the lower level executive layer can figure out interactive action with guidance of activated signal,and the value of both activated signal and interactive action is evaluated by critic structure together.This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer.According to the result of experiment,it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.展开更多
As intelligent vehicles usually have complex overtaking process,a safe and efficient automated overtaking system(AOS)is vital to avoid accidents caused by wrong operation of drivers.Existing AOSs rarely consider longi...As intelligent vehicles usually have complex overtaking process,a safe and efficient automated overtaking system(AOS)is vital to avoid accidents caused by wrong operation of drivers.Existing AOSs rarely consider longitudinal reactions of the overtaken vehicle(OV)during overtaking.This paper proposed a novel AOS based on hierarchical reinforcement learning,where the longitudinal reaction is given by a data-driven social preference estimation.This AOS incorporates two modules that can function in different overtaking phases.The first module based on semi-Markov decision process and motion primitives is built for motion planning and control.The second module based on Markov decision process is designed to enable vehicles to make proper decisions according to the social preference of OV.Based on realistic overtaking data,the proposed AOS and its modules are verified experimentally.The results of the tests show that the proposed AOS can realize safe and effective overtaking in scenes built by realistic data,and has the ability to flexibly adjust lateral driving behavior and lane changing position when the OVs have different social preferences.展开更多
Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically ...Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent's actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.展开更多
时序抽象作为分层强化学习的重要研究内容,允许分层强化学习智能体在不同的时间尺度上学习策略,可以有效解决深度强化学习难以处理的稀疏奖励问题。如何端到端地学习到优秀的时序抽象策略一直是分层强化学习研究面临的挑战。Option-Crit...时序抽象作为分层强化学习的重要研究内容,允许分层强化学习智能体在不同的时间尺度上学习策略,可以有效解决深度强化学习难以处理的稀疏奖励问题。如何端到端地学习到优秀的时序抽象策略一直是分层强化学习研究面临的挑战。Option-Critic(OC)框架在Option框架的基础上,通过策略梯度理论,可以有效解决此问题。然而,在策略学习过程中,OC框架会出现Option内部策略动作分布变得十分相似的退化问题。该退化问题影响了OC框架的实验性能,导致Option的可解释性变差。为了解决上述问题,引入互信息知识作为内部奖励,并提出基于互信息优化的Option-Critic算法(Option-Critic Algorithm with Mutual Information Optimization,MIOOC)。MIOOC算法结合了近端策略Option-Critic(Proximal Policy Option-Critic,PPOC)算法,可以保证下层策略的多样性。为了验证算法的有效性,把MIOOC算法和几种常见的强化学习方法在连续实验环境中进行对比实验。实验结果表明,MIOOC算法可以加快模型学习速度,实验性能更优,Option内部策略更有区分度。展开更多
This paper develops a novel hierarchical control strategy for improving the trajectory tracking capability of aerial robots under parameter uncertainties.The hierarchical control strategy is composed of an adaptive sl...This paper develops a novel hierarchical control strategy for improving the trajectory tracking capability of aerial robots under parameter uncertainties.The hierarchical control strategy is composed of an adaptive sliding mode controller and a model-free iterative sliding mode controller(MFISMC).A position controller is designed based on adaptive sliding mode control(SMC)to safely drive the aerial robot and ensure fast state convergence under external disturbances.Additionally,the MFISMC acts as an attitude controller to estimate the unmodeled dynamics without detailed knowledge of aerial robots.Then,the adaption laws are derived with the Lyapunov theory to guarantee the asymptotic tracking of the system state.Finally,to demonstrate the performance and robustness of the proposed control strategy,numerical simulations are carried out,which are also compared with other conventional strategies,such as proportional-integralderivative(PID),backstepping(BS),and SMC.The simulation results indicate that the proposed hierarchical control strategy can fulfill zero steady-state error and achieve faster convergence compared with conventional strategies.展开更多
现代飞行程序设计受地形、障碍物、空域和飞行性能等多种因素的影响,设计过程中需进行大量针对设计细节有效性的评估工作;设计完毕的飞行程序还需专业的飞行试飞人员进行模拟机和真机试飞,耗费大量的人力、经济成本。如果试飞前缺少针...现代飞行程序设计受地形、障碍物、空域和飞行性能等多种因素的影响,设计过程中需进行大量针对设计细节有效性的评估工作;设计完毕的飞行程序还需专业的飞行试飞人员进行模拟机和真机试飞,耗费大量的人力、经济成本。如果试飞前缺少针对性的分析评估,一方面会增加试飞成本的支出,另一方面也会导致真机试飞环节存在安全隐患。针对上述问题,利用深度强化学习技术,提出一种在满足飞行程序设计规范条件下,面向飞行程序有效性和可行性验证的离场航迹自动生成方法。首先,利用空气动力学原理,建立考虑飞行性能和障碍物超障因素的基本飞行动力学模型,并借助Unity3D引擎构建三维可视化的训练平台;其次,在PyTorch深度学习框架中,利用Mlagents强化学习平台构建航空器在飞行时各个阶段的试飞训练模型,设计包括起飞、转弯、巡航和降落这4个目标的场景和奖励函数。以离场飞行程序试飞为例,采用厦门高崎机场某PBN(Performance Based Navigation)离场程序进行实例训练验证,并利用动态时间规整(DTW)距离量化实际生成航迹与标称航迹之间的偏离度。实验结果显示,偏差度满足飞行程序超障保护区的限制要求。上述训练模型在其他离场程序的实验结果也验证了模型具有较好的泛化能力。展开更多
基金supported by the National Natural Science Foundation of China(61303108)Suzhou Key Industries Technological Innovation-Prospective Applied Research Project(SYG201804)+2 种基金A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)the Fundamental Research Funds for the Gentral UniversitiesJLU(93K172020K25)。
文摘In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision(HADS)which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism.HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning.Due to the fact that not all sub-goal points are reachable,a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm.HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals,while those that are not will be removed based on prior knowledge.Our experiments verified the effect of the algorithm.
基金supported by the National Natural Science Foundation of China(62003021,91212304).
文摘The guidance strategy is an extremely critical factor in determining the striking effect of the missile operation.A novel guidance law is presented by exploiting the deep reinforcement learning(DRL)with the hierarchical deep deterministic policy gradient(DDPG)algorithm.The reward functions are constructed to minimize the line-of-sight(LOS)angle rate and avoid the threat caused by the opposed obstacles.To attenuate the chattering of the acceleration,a hierarchical reinforcement learning structure and an improved reward function with action penalty are put forward.The simulation results validate that the missile under the proposed method can hit the target successfully and keep away from the threatened areas effectively.
文摘Proposes a reinforcement learning scheme based on a special Hierarchical Fuzzy Neural-Networks (HFNN)for solving complicated learning tasks in a continuous multi-variables environment. The output of the previous layer in the HFNN is no longer used as if-part of the next layer, but used only in then-part. Thus it can deal with the difficulty when the output of the previous layer is meaningless or its meaning is uncertain. The proposed HFNN has a minimal number of fuzzy rules and can successfully solve the problem of rules combination explosion and decrease the quantity of computation and memory requirement. In the learning process, two HFNN with the same structure perform fuzzy action composition and evaluation function approximation simultaneously where the parameters of neural-networks are tuned and updated on line by using gradient descent algorithm. The reinforcement learning method is proved to be correct and feasible by simulation of a double inverted pendulum system.
基金the National Natural Science Foundation of China (No.61673265)the National Key Research and Development Program (No.2020YFC1512203)the Shanghai Commercial Aircraft System Engineering Joint Research Fund (No.CASEF-2022-Z05)。
文摘Based on option-critic algorithm,a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent's performance against opponent with fixed offensive algorithm.An option network is introduced in upper level design,which can generate activated signal from defensive and of-fensive strategies according to temporary situation.Then the lower level executive layer can figure out interactive action with guidance of activated signal,and the value of both activated signal and interactive action is evaluated by critic structure together.This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer.According to the result of experiment,it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.
基金The authors would like to appreciate the financial support of the National Natural Science Foundation of China(Grant No.61703041)the technological innovation program of Beijing Institute of Technology(2021CX11006).
文摘As intelligent vehicles usually have complex overtaking process,a safe and efficient automated overtaking system(AOS)is vital to avoid accidents caused by wrong operation of drivers.Existing AOSs rarely consider longitudinal reactions of the overtaken vehicle(OV)during overtaking.This paper proposed a novel AOS based on hierarchical reinforcement learning,where the longitudinal reaction is given by a data-driven social preference estimation.This AOS incorporates two modules that can function in different overtaking phases.The first module based on semi-Markov decision process and motion primitives is built for motion planning and control.The second module based on Markov decision process is designed to enable vehicles to make proper decisions according to the social preference of OV.Based on realistic overtaking data,the proposed AOS and its modules are verified experimentally.The results of the tests show that the proposed AOS can realize safe and effective overtaking in scenes built by realistic data,and has the ability to flexibly adjust lateral driving behavior and lane changing position when the OVs have different social preferences.
基金supported by the National Basic Research Program of China (2013CB329603)the National Natural Science Foundation of China (61375058, 71231002)+1 种基金the China Mobile Research Fund (MCM 20130351)the Ministry of Education of China and the Special Co-Construction Project of Beijing Municipal Commission of Education
文摘Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent's actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.
文摘时序抽象作为分层强化学习的重要研究内容,允许分层强化学习智能体在不同的时间尺度上学习策略,可以有效解决深度强化学习难以处理的稀疏奖励问题。如何端到端地学习到优秀的时序抽象策略一直是分层强化学习研究面临的挑战。Option-Critic(OC)框架在Option框架的基础上,通过策略梯度理论,可以有效解决此问题。然而,在策略学习过程中,OC框架会出现Option内部策略动作分布变得十分相似的退化问题。该退化问题影响了OC框架的实验性能,导致Option的可解释性变差。为了解决上述问题,引入互信息知识作为内部奖励,并提出基于互信息优化的Option-Critic算法(Option-Critic Algorithm with Mutual Information Optimization,MIOOC)。MIOOC算法结合了近端策略Option-Critic(Proximal Policy Option-Critic,PPOC)算法,可以保证下层策略的多样性。为了验证算法的有效性,把MIOOC算法和几种常见的强化学习方法在连续实验环境中进行对比实验。实验结果表明,MIOOC算法可以加快模型学习速度,实验性能更优,Option内部策略更有区分度。
文摘This paper develops a novel hierarchical control strategy for improving the trajectory tracking capability of aerial robots under parameter uncertainties.The hierarchical control strategy is composed of an adaptive sliding mode controller and a model-free iterative sliding mode controller(MFISMC).A position controller is designed based on adaptive sliding mode control(SMC)to safely drive the aerial robot and ensure fast state convergence under external disturbances.Additionally,the MFISMC acts as an attitude controller to estimate the unmodeled dynamics without detailed knowledge of aerial robots.Then,the adaption laws are derived with the Lyapunov theory to guarantee the asymptotic tracking of the system state.Finally,to demonstrate the performance and robustness of the proposed control strategy,numerical simulations are carried out,which are also compared with other conventional strategies,such as proportional-integralderivative(PID),backstepping(BS),and SMC.The simulation results indicate that the proposed hierarchical control strategy can fulfill zero steady-state error and achieve faster convergence compared with conventional strategies.
文摘现代飞行程序设计受地形、障碍物、空域和飞行性能等多种因素的影响,设计过程中需进行大量针对设计细节有效性的评估工作;设计完毕的飞行程序还需专业的飞行试飞人员进行模拟机和真机试飞,耗费大量的人力、经济成本。如果试飞前缺少针对性的分析评估,一方面会增加试飞成本的支出,另一方面也会导致真机试飞环节存在安全隐患。针对上述问题,利用深度强化学习技术,提出一种在满足飞行程序设计规范条件下,面向飞行程序有效性和可行性验证的离场航迹自动生成方法。首先,利用空气动力学原理,建立考虑飞行性能和障碍物超障因素的基本飞行动力学模型,并借助Unity3D引擎构建三维可视化的训练平台;其次,在PyTorch深度学习框架中,利用Mlagents强化学习平台构建航空器在飞行时各个阶段的试飞训练模型,设计包括起飞、转弯、巡航和降落这4个目标的场景和奖励函数。以离场飞行程序试飞为例,采用厦门高崎机场某PBN(Performance Based Navigation)离场程序进行实例训练验证,并利用动态时间规整(DTW)距离量化实际生成航迹与标称航迹之间的偏离度。实验结果显示,偏差度满足飞行程序超障保护区的限制要求。上述训练模型在其他离场程序的实验结果也验证了模型具有较好的泛化能力。