With the complexity of the composition process and the rapid growth of candidate services,realizing optimal or near-optimal service composition is an urgent problem.Currently,the static service composition chain is ri...With the complexity of the composition process and the rapid growth of candidate services,realizing optimal or near-optimal service composition is an urgent problem.Currently,the static service composition chain is rigid and cannot be easily adapted to the dynamic Web environment.To address these challenges,the geographic information service composition(GISC) problem as a sequential decision-making task is modeled.In addition,the Markov decision process(MDP),as a universal model for the planning problem of agents,is used to describe the GISC problem.Then,to achieve self-adaptivity and optimization in a dynamic environment,a novel approach that integrates Monte Carlo tree search(MCTS) and a temporal-difference(TD) learning algorithm is proposed.The concrete services of abstract services are determined with optimal policies and adaptive capability at runtime,based on the environment and the status of component services.The simulation experiment is performed to demonstrate the effectiveness and efficiency through learning quality and performance.展开更多
A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for genera...A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.展开更多
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行...为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.展开更多
针对现有合作学习算法存在频繁通信、能量消耗过大等问题,应用目标跟踪建立任务模型,文章提出一种基于Q学习和TD误差(Q-learning and TD error,QT)的传感器节点任务调度算法。具体包括将传感器节点任务调度问题映射成Q学习可解决的学习...针对现有合作学习算法存在频繁通信、能量消耗过大等问题,应用目标跟踪建立任务模型,文章提出一种基于Q学习和TD误差(Q-learning and TD error,QT)的传感器节点任务调度算法。具体包括将传感器节点任务调度问题映射成Q学习可解决的学习问题,建立邻居节点间的协作机制以及定义延迟回报、状态空间等基本学习元素。在协作机制中,QT使得传感器节点利用个体和群体的TD误差,通过动态改变自身的学习速度来平衡自身利益和群体利益。此外,QT根据Metropolis准则提高节点学习前期的探索概率,优化任务选择。实验结果表明:QT具备根据当前环境进行动态调度任务的能力;相比其他任务调度算法,QT消耗合理的能量使得单位性能提高了17.26%。展开更多
Differential signals are key in control engineering as they anticipate future behavior of process variables and therefore are critical in formulating control laws such as proportional-integral-derivative(PID).The prac...Differential signals are key in control engineering as they anticipate future behavior of process variables and therefore are critical in formulating control laws such as proportional-integral-derivative(PID).The practical challenge,however,is to extract such signals from noisy measurements and this difficulty is addressed first by J.Han in the form of linear and nonlinear tracking differentiator(TD).While improvements were made,TD did not completely resolve the conflict between the noise sensitivity and the accuracy and timeliness of the differentiation.The two approaches proposed in this paper start with the basic linear TD,but apply iterative learning mechanism to the historical data in a moving window(MW),to form two new iterative learning tracking differentiators(IL-TD):one is a parallel IL-TD using an iterative ladder network structure which is implementable in analog circuits;the other a serial IL-TD which is implementable digitally on any computer platform.Both algorithms are validated in simulations which show that the proposed two IL-TDs have better tracking differentiation and de-noise performance compared to the existing linear TD.展开更多
基金Supported by the National Natural Science Foundation of China(No.41971356,41671400,41701446)National Key Research and Development Program of China(No.2017YFB0503600,2018YFB0505500)Hubei Province Natural Science Foundation of China(No.2017CFB277)。
文摘With the complexity of the composition process and the rapid growth of candidate services,realizing optimal or near-optimal service composition is an urgent problem.Currently,the static service composition chain is rigid and cannot be easily adapted to the dynamic Web environment.To address these challenges,the geographic information service composition(GISC) problem as a sequential decision-making task is modeled.In addition,the Markov decision process(MDP),as a universal model for the planning problem of agents,is used to describe the GISC problem.Then,to achieve self-adaptivity and optimization in a dynamic environment,a novel approach that integrates Monte Carlo tree search(MCTS) and a temporal-difference(TD) learning algorithm is proposed.The concrete services of abstract services are determined with optimal policies and adaptive capability at runtime,based on the environment and the status of component services.The simulation experiment is performed to demonstrate the effectiveness and efficiency through learning quality and performance.
文摘A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an expert program, and 3) Learning from viewing ex-perts play against each other. Although the third possibility generates high-quality games from the start compared to initial random games generated by self-play, the drawback is that the learning program is never allowed to test moves which it prefers. Since our expert program uses a similar evaluation function as the learning program, we also examine whether it is helpful to learn directly from the board evaluations given by the expert. We compared these methods using temporal difference methods with neural networks to learn the game of backgammon.
文摘为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.
文摘针对现有合作学习算法存在频繁通信、能量消耗过大等问题,应用目标跟踪建立任务模型,文章提出一种基于Q学习和TD误差(Q-learning and TD error,QT)的传感器节点任务调度算法。具体包括将传感器节点任务调度问题映射成Q学习可解决的学习问题,建立邻居节点间的协作机制以及定义延迟回报、状态空间等基本学习元素。在协作机制中,QT使得传感器节点利用个体和群体的TD误差,通过动态改变自身的学习速度来平衡自身利益和群体利益。此外,QT根据Metropolis准则提高节点学习前期的探索概率,优化任务选择。实验结果表明:QT具备根据当前环境进行动态调度任务的能力;相比其他任务调度算法,QT消耗合理的能量使得单位性能提高了17.26%。
基金supported by National Natural Science Foundation of China(61773170,62173151)the Natural Science Foundation of Guangdong Province(2023A1515010949,2021A1515011850).
文摘Differential signals are key in control engineering as they anticipate future behavior of process variables and therefore are critical in formulating control laws such as proportional-integral-derivative(PID).The practical challenge,however,is to extract such signals from noisy measurements and this difficulty is addressed first by J.Han in the form of linear and nonlinear tracking differentiator(TD).While improvements were made,TD did not completely resolve the conflict between the noise sensitivity and the accuracy and timeliness of the differentiation.The two approaches proposed in this paper start with the basic linear TD,but apply iterative learning mechanism to the historical data in a moving window(MW),to form two new iterative learning tracking differentiators(IL-TD):one is a parallel IL-TD using an iterative ladder network structure which is implementable in analog circuits;the other a serial IL-TD which is implementable digitally on any computer platform.Both algorithms are validated in simulations which show that the proposed two IL-TDs have better tracking differentiation and de-noise performance compared to the existing linear TD.