期刊文献+
共找到3,508篇文章
< 1 2 176 >
每页显示 20 50 100
Team-based fixed-time containment control for multi-agent systems with disturbances
1
作者 赵小文 王进月 +1 位作者 赖强 刘源 《Chinese Physics B》 SCIE EI CAS CSCD 2023年第12期281-292,共12页
We investigate the fixed-time containment control(FCC)problem of multi-agent systems(MASs)under discontinuous communication.A saturation function is used in the controller to achieve the containment control in MASs.On... We investigate the fixed-time containment control(FCC)problem of multi-agent systems(MASs)under discontinuous communication.A saturation function is used in the controller to achieve the containment control in MASs.One difference from using a symbolic function is that it avoids the differential calculation process for discontinuous functions,which further ensures the continuity of the control input.Considering the discontinuous communication,a dynamic variable is constructed,which is always non-negative between any two communications of the agent.Based on the designed variable,the dynamic event-triggered algorithm is proposed to achieve FCC,which can effectively reduce controller updating.In addition,we further design a new event-triggered algorithm to achieve FCC,called the team-trigger mechanism,which combines the self-triggering technique with the proposed dynamic event trigger mechanism.It has faster convergence than the proposed dynamic event triggering technique and achieves the tradeoff between communication cost,convergence time and number of triggers in MASs.Finally,Zeno behavior is excluded and the validity of the proposed theory is confirmed by simulation. 展开更多
关键词 fixed-time containment control dynamic event-triggered strategy team-based triggered strategy multi-agent systems
下载PDF
Evolutionary dynamics of tax-based strong altruistic reward andpunishment in a public goods game
2
作者 Zhi-Hao Yang Yan-Long Yang 《Chinese Physics B》 SCIE EI CAS CSCD 2024年第9期247-257,共11页
In public goods games, punishments and rewards have been shown to be effective mechanisms for maintaining individualcooperation. However, punishments and rewards are costly to incentivize cooperation. Therefore, the g... In public goods games, punishments and rewards have been shown to be effective mechanisms for maintaining individualcooperation. However, punishments and rewards are costly to incentivize cooperation. Therefore, the generation ofcostly penalties and rewards has been a complex problem in promoting the development of cooperation. In real society,specialized institutions exist to punish evil people or reward good people by collecting taxes. We propose a strong altruisticpunishment or reward strategy in the public goods game through this phenomenon. Through theoretical analysis and numericalcalculation, we can get that tax-based strong altruistic punishment (reward) has more evolutionary advantages thantraditional strong altruistic punishment (reward) in maintaining cooperation and tax-based strong altruistic reward leads toa higher level of cooperation than tax-based strong altruistic punishment. 展开更多
关键词 evolutionary game theory strong altruism PUNISHMENT reward
下载PDF
Evolutionary analysis of green credit and automobile enterprises under the mechanism of dynamic reward and punishment based on government regulation
3
作者 Yu Dong Xiaoyu Huang +1 位作者 Hongan Gan Xuyang Liu 《中国科学技术大学学报》 CAS CSCD 北大核心 2024年第5期49-62,I0007,共15页
To explore the green development of automobile enterprises and promote the achievement of the“dual carbon”target,based on the bounded rationality assumptions,this study constructed a tripartite evolutionary game mod... To explore the green development of automobile enterprises and promote the achievement of the“dual carbon”target,based on the bounded rationality assumptions,this study constructed a tripartite evolutionary game model of gov-ernment,commercial banks,and automobile enterprises;introduced a dynamic reward and punishment mechanism;and analyzed the development process of the three parties’strategic behavior under the static and dynamic reward and punish-ment mechanism.Vensim PLE was used for numerical simulation analysis.Our results indicate that the system could not reach a stable state under the static reward and punishment mechanism.A dynamic reward and punishment mechanism can effectively improve the system stability and better fit real situations.Under the dynamic reward and punishment mechan-ism,an increase in the initial probabilities of the three parties can promote the system stability,and the government can im-plement effective supervision by adjusting the upper limit of the reward and punishment intensity.Finally,the implementa-tion of green credit by commercial banks plays a significant role in promoting the green development of automobile enter-prises. 展开更多
关键词 automobile enterprises green credit system dynamics reward and punishment mechanism
下载PDF
Improved Double Deep Q Network Algorithm Based on Average Q-Value Estimation and Reward Redistribution for Robot Path Planning
4
作者 Yameng Yin Lieping Zhang +3 位作者 Xiaoxu Shi Yilin Wang Jiansheng Peng Jianchu Zou 《Computers, Materials & Continua》 SCIE EI 2024年第11期2769-2790,共22页
By integrating deep neural networks with reinforcement learning,the Double Deep Q Network(DDQN)algorithm overcomes the limitations of Q-learning in handling continuous spaces and is widely applied in the path planning... By integrating deep neural networks with reinforcement learning,the Double Deep Q Network(DDQN)algorithm overcomes the limitations of Q-learning in handling continuous spaces and is widely applied in the path planning of mobile robots.However,the traditional DDQN algorithm suffers from sparse rewards and inefficient utilization of high-quality data.Targeting those problems,an improved DDQN algorithm based on average Q-value estimation and reward redistribution was proposed.First,to enhance the precision of the target Q-value,the average of multiple previously learned Q-values from the target Q network is used to replace the single Q-value from the current target Q network.Next,a reward redistribution mechanism is designed to overcome the sparse reward problem by adjusting the final reward of each action using the round reward from trajectory information.Additionally,a reward-prioritized experience selection method is introduced,which ranks experience samples according to reward values to ensure frequent utilization of high-quality data.Finally,simulation experiments are conducted to verify the effectiveness of the proposed algorithm in fixed-position scenario and random environments.The experimental results show that compared to the traditional DDQN algorithm,the proposed algorithm achieves shorter average running time,higher average return and fewer average steps.The performance of the proposed algorithm is improved by 11.43%in the fixed scenario and 8.33%in random environments.It not only plans economic and safe paths but also significantly improves efficiency and generalization in path planning,making it suitable for widespread application in autonomous navigation and industrial automation. 展开更多
关键词 Double Deep Q Network path planning average Q-value estimation reward redistribution mechanism reward-prioritized experience selection method
下载PDF
Efficient Optimal Routing Algorithm Based on Reward and Penalty for Mobile Adhoc Networks
5
作者 Anubha Ravneet Preet Singh Bedi +3 位作者 Arfat Ahmad Khan Mohd Anul Haq Ahmad Alhussen Zamil S.Alzamil 《Computers, Materials & Continua》 SCIE EI 2023年第4期1331-1351,共21页
Mobile adhoc networks have grown in prominence in recent years,and they are now utilized in a broader range of applications.The main challenges are related to routing techniques that are generally employed in them.Mob... Mobile adhoc networks have grown in prominence in recent years,and they are now utilized in a broader range of applications.The main challenges are related to routing techniques that are generally employed in them.Mobile Adhoc system management,on the other hand,requires further testing and improvements in terms of security.Traditional routing protocols,such as Adhoc On-Demand Distance Vector(AODV)and Dynamic Source Routing(DSR),employ the hop count to calculate the distance between two nodes.The main aim of this research work is to determine the optimum method for sending packets while also extending life time of the network.It is achieved by changing the residual energy of each network node.Also,in this paper,various algorithms for optimal routing based on parameters like energy,distance,mobility,and the pheromone value are proposed.Moreover,an approach based on a reward and penalty system is given in this paper to evaluate the efficiency of the proposed algorithms under the impact of parameters.The simulation results unveil that the reward penalty-based approach is quite effective for the selection of an optimal path for routing when the algorithms are implemented under the parameters of interest,which helps in achieving less packet drop and energy consumption of the nodes along with enhancing the network efficiency. 展开更多
关键词 ROUTING optimization reward PENALTY MOBILITY energy THROUGHOUT PHEROMONE
下载PDF
Magnetic Field-Based Reward Shaping for Goal-Conditioned Reinforcement Learning
6
作者 Hongyu Ding Yuanze Tang +3 位作者 Qing Wu Bo Wang Chunlin Chen Zhi Wang 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2023年第12期2233-2247,共15页
Goal-conditioned reinforcement learning(RL)is an interesting extension of the traditional RL framework,where the dynamic environment and reward sparsity can cause conventional learning algorithms to fail.Reward shapin... Goal-conditioned reinforcement learning(RL)is an interesting extension of the traditional RL framework,where the dynamic environment and reward sparsity can cause conventional learning algorithms to fail.Reward shaping is a practical approach to improving sample efficiency by embedding human domain knowledge into the learning process.Existing reward shaping methods for goal-conditioned RL are typically built on distance metrics with a linear and isotropic distribution,which may fail to provide sufficient information about the ever-changing environment with high complexity.This paper proposes a novel magnetic field-based reward shaping(MFRS)method for goal-conditioned RL tasks with dynamic target and obstacles.Inspired by the physical properties of magnets,we consider the target and obstacles as permanent magnets and establish the reward function according to the intensity values of the magnetic field generated by these magnets.The nonlinear and anisotropic distribution of the magnetic field intensity can provide more accessible and conducive information about the optimization landscape,thus introducing a more sophisticated magnetic reward compared to the distance-based setting.Further,we transform our magnetic reward to the form of potential-based reward shaping by learning a secondary potential function concurrently to ensure the optimal policy invariance of our method.Experiments results in both simulated and real-world robotic manipulation tasks demonstrate that MFRS outperforms relevant existing methods and effectively improves the sample efficiency of RL algorithms in goal-conditioned tasks with various dynamics of the target and obstacles. 展开更多
关键词 Dynamic environments goal-conditioned reinforcement learning magnetic field reward shaping
下载PDF
考虑奖励机制的电动汽车充电优化引导策略 被引量:2
7
作者 张建宏 赵兴勇 王秀丽 《电网与清洁能源》 CSCD 北大核心 2024年第1期102-108,118,共8页
随着电动汽车(electric vehicle,EV)的大规模推广,其无序充电严重威胁电网的安全稳定运行,积极引导EV用户参与充电优化策略,对于提高电网的安全稳定性具有重要意义。为此,基于充电优化管理调度思路,提出一种考虑奖励机制的EV充电优化引... 随着电动汽车(electric vehicle,EV)的大规模推广,其无序充电严重威胁电网的安全稳定运行,积极引导EV用户参与充电优化策略,对于提高电网的安全稳定性具有重要意义。为此,基于充电优化管理调度思路,提出一种考虑奖励机制的EV充电优化引导策略,在分时电价的基础上,计入用户在降低电网负荷波动中的奖励机制,考虑充电位置固定、不确定用户的出行需求,确定EV的充电时间及充电位置,达到用户满意度最高的目的;利用EV动态响应的实时优化算法,对所提的优化调度模型进行求解。仿真结果验证了所提策略的有效性和可行性,该优化调度策略不仅能有效改善负荷低谷时段集中充电形成新的负荷高峰的问题,而且可明显降低用户的充电成本及电网负荷波动。 展开更多
关键词 电动汽车 充电控制 负荷波动 奖励机制 优化引导策略
下载PDF
基于季节性碳交易机制的园区综合能源系统低碳经济调度 被引量:9
8
作者 颜宁 马广超 +2 位作者 李相俊 李洋 马少华 《中国电机工程学报》 EI CSCD 北大核心 2024年第3期918-931,I0006,共15页
为有效提高碳排放配额分配的合理性,并且避免年度结算时碳排放量超标导致环境污染加剧问题,提出基于奖惩因子的季节性碳交易机制,以园区综合能源系统(park integrated energy system,PIES)为对象进行低碳经济调度。首先,构建包含能量层... 为有效提高碳排放配额分配的合理性,并且避免年度结算时碳排放量超标导致环境污染加剧问题,提出基于奖惩因子的季节性碳交易机制,以园区综合能源系统(park integrated energy system,PIES)为对象进行低碳经济调度。首先,构建包含能量层–碳流层–管理层的综合能源系统(integrated energy system,IES)运行框架,建立电气热多能流供需动态一致性模型;其次,分析系统内“日–季节–年度”碳排放特性,打破传统应用指标法的配额分配方法,采用灰色关联分析法建立碳排放配额分配模型,并基于奖惩阶梯碳价制定季节性碳交易机制;最后,以系统内全寿命周期运行成本及碳交易成本最小为目标,对执行季节性碳交易机制的PIES进行低碳经济调度,分析长时间尺度下季节性储能参与调度的减碳量。搭建IEEE 33节点电网5节点气网7节点热网的PIES,并基于多场景进行算例分析,验证此调度方法能够实现零碳经济运行,保证系统供能可靠性,为建立零碳园区奠定理论基础。 展开更多
关键词 园区综合能源系统 季节性碳交易机制 奖惩阶梯碳价 灰色关联分析法
下载PDF
Effectiveness of Reward System on Assessment Outcomes in Mathematics
9
作者 May Semira Inandan 《Journal of Contemporary Educational Research》 2023年第9期52-58,共7页
As assessment outcomes provide students with a sense of accomplishment that is boosted by the reward system,learning becomes more effective.This research aims to determine the effects of reward system prior to assessm... As assessment outcomes provide students with a sense of accomplishment that is boosted by the reward system,learning becomes more effective.This research aims to determine the effects of reward system prior to assessment in Mathematics.Quasi-experimental research design was used to examine whether there was a significant difference between the use of reward system and students’level of performance in Mathematics.Through purposive sampling,the respondents of the study involve 80 Grade 9 students belonging to two sections from Gaudencio B.Lontok Memorial Integrated School.Based on similar demographics and pre-test results,control and study group were involved as participants of the study.Data were treated and analyzed accordingly using statistical treatments such as mean and t-test for independent variables.There was a significant finding revealing the advantage of using the reward system compare to the non-reward system in increasing students’level of performance in Mathematics.It is concluded that the use of reward system is effective in improving the assessment outcomes in Mathematics.It is recommended to use reward system for persistent assessment outcomes prior to assessment,to be a reflection of the intended outcomes in Mathematics. 展开更多
关键词 MATHEMATICS reward system Assessment outcomes
下载PDF
政府监管下直播带货平台合谋行为的奖惩机制研究 被引量:5
10
作者 李国昊 梅婷 梁永滔 《江苏大学学报(社会科学版)》 CSSCI 2024年第2期100-112,共13页
“直播+电商”的新型商品销售模式正飞速发展,但该过程中存在诸多问题。本文考虑了直播带货平台与平台商家合谋以获取超额利润的现象,建立并分析了不同的奖惩机制下直播带货平台与政府监管机构的演化博弈模型,最终得出以下结论:静态奖... “直播+电商”的新型商品销售模式正飞速发展,但该过程中存在诸多问题。本文考虑了直播带货平台与平台商家合谋以获取超额利润的现象,建立并分析了不同的奖惩机制下直播带货平台与政府监管机构的演化博弈模型,最终得出以下结论:静态奖惩机制与动态奖励静态惩罚机制下,系统不存在稳定均衡点;静态奖励动态惩罚和动态奖励动态惩罚机制下,系统存在稳定均衡点,但动态奖惩机制下直播带货平台与平台商家合谋的概率更低。动态奖惩机制下,直播带货平台和直播电商合谋行为与奖惩力度有关,当惩罚力度增加时,直播带货平台与平台商家合谋的概率下降,政府监管成本降低;当奖励强度增加时,政府严格监管概率降低,直播带货平台与平台商家合谋概率降低但变化较小。因此,政府监管机构采用科学合理的动态奖惩机制有助于直播带货行业的稳健发展。 展开更多
关键词 直播带货平台 奖惩机制 演化博弈 合谋行为
下载PDF
生态补偿视角下流域跨界水污染协同治理机制设计及演化博弈分析 被引量:1
11
作者 杨霞 何刚 +1 位作者 吴传良 张世玉 《安全与环境学报》 CAS CSCD 北大核心 2024年第5期2033-2042,共10页
针对流域相邻两地区和流域管理机构三方博弈主体,引入双向生态补偿-奖惩机制,构建流域跨界水污染三方演化博弈理论模型。通过稳定性分析得出流域跨界水污染协同治理理想状态的稳定条件,并结合新安江流域生态补偿试点案例进行仿真分析。... 针对流域相邻两地区和流域管理机构三方博弈主体,引入双向生态补偿-奖惩机制,构建流域跨界水污染三方演化博弈理论模型。通过稳定性分析得出流域跨界水污染协同治理理想状态的稳定条件,并结合新安江流域生态补偿试点案例进行仿真分析。结果表明:(1)引入双向生态补偿-奖惩机制可有效推动新安江流域相邻两地采取达标排放行为,促使系统达到(1, 1, 0)稳定状态;(2)动态奖惩机制组合使用有助于系统演化,从博弈主体初始意愿、实施效能和支持倾向等角度综合考虑,动态奖励-静态惩罚策略监管效果最优,动态奖励-动态惩罚策略次之;(3)流域跨界水污染协同治理的实现策略与相邻两地达标排放的治理成本与收益、双向生态补偿金额度、流域管理机构发放生态补偿奖励金额、积极监管成本和收益等因素密切相关。 展开更多
关键词 环境学 水污染 演化博弈 生态补偿 动态奖惩 协同治理
下载PDF
一种改进蚁群算法的路径规划研究 被引量:2
12
作者 刘海鹏 念紫帅 《小型微型计算机系统》 CSCD 北大核心 2024年第4期853-858,共6页
针对机器人在复杂环境中的路径规划问题,本文提出了一种改进蚁群算法的路径规划研究方法.首先,在启发函数中引入一种自适应调整的放大因子,以提高相邻节点的启发信息差异,使蚂蚁朝着最优路径的方向搜索;其次,采用一种奖惩机制对路径上... 针对机器人在复杂环境中的路径规划问题,本文提出了一种改进蚁群算法的路径规划研究方法.首先,在启发函数中引入一种自适应调整的放大因子,以提高相邻节点的启发信息差异,使蚂蚁朝着最优路径的方向搜索;其次,采用一种奖惩机制对路径上的信息素进行更新,使算法的收敛速度得到有效的提高;然后,通过对信息素挥发因子进行动态调整,提高蚁群的搜索速度,使算法快速收敛.最后,在最优路径的基础上,采用拐点优化算法与分段B样条曲线相结合的方法来进行路径优化,有效的改善了路径的平滑性.仿真结果表明,所提的研究方法具有更好的收敛性和搜索能力,更符合机器人运动的实际要求. 展开更多
关键词 启发函数 奖惩机制 信息素挥发因子 路径优化
下载PDF
稀疏奖励场景下基于状态空间探索的多智能体强化学习算法
13
作者 方宝富 余婷婷 +1 位作者 王浩 王在俊 《模式识别与人工智能》 EI CSCD 北大核心 2024年第5期435-446,共12页
多智能体的任务场景往往伴随着庞大、多样的状态空间,而且在某些情况下,外部环境提供的奖励信息可能非常有限,呈现出稀疏奖励的特征.现有的大部分多智能体强化学习算法在此类稀疏奖励场景下效果有限,因为算法仅依赖于偶然发现的奖励序列... 多智能体的任务场景往往伴随着庞大、多样的状态空间,而且在某些情况下,外部环境提供的奖励信息可能非常有限,呈现出稀疏奖励的特征.现有的大部分多智能体强化学习算法在此类稀疏奖励场景下效果有限,因为算法仅依赖于偶然发现的奖励序列,会导致学习过程缓慢和低效.为了解决这一问题,文中提出基于状态空间探索的多智能体强化学习算法,构建状态子集空间,从中映射出一个状态,并将其作为内在目标,使智能体更充分利用状态空间并减少不必要的探索.将智能体状态分解成自身状态与环境状态,结合这两类状态与内在目标,生成基于互信息的内在奖励.构建状态子集空间和基于互信息的内在奖励,对接近目标状态的状态与理解环境的状态给予适当的奖励,以激励智能体更积极地朝着目标前进,同时增强对环境的理解,从而引导其灵活适应稀疏奖励场景.在稀疏程度不同的多智能体协作场景中的实验验证文中算法性能较优. 展开更多
关键词 强化学习 稀疏奖励 互信息 内在奖励
下载PDF
基于深度强化学习的SCR脱硝系统协同控制策略研究 被引量:3
14
作者 赵征 刘子涵 《动力工程学报》 CAS CSCD 北大核心 2024年第5期802-809,共8页
针对选择性催化还原(SCR)脱硝系统大惯性、多扰动等特点,提出了一种基于多维状态信息和分段奖励函数优化的深度确定性策略梯度(DDPG)协同比例积分微分(PID)控制器的控制策略。针对SCR脱硝系统中存在部分可观测马尔可夫决策过程(POMDP),... 针对选择性催化还原(SCR)脱硝系统大惯性、多扰动等特点,提出了一种基于多维状态信息和分段奖励函数优化的深度确定性策略梯度(DDPG)协同比例积分微分(PID)控制器的控制策略。针对SCR脱硝系统中存在部分可观测马尔可夫决策过程(POMDP),导致DDPG算法策略学习效率较低的问题,首先设计SCR脱硝系统的多维状态信息;其次,设计SCR脱硝系统的分段奖励函数;最后,设计DDPG-PID协同控制策略,以实现SCR脱硝系统的控制。结果表明:所设计的DDPG-PID协同控制策略提高了DDPG算法的策略学习效率,改善了PID的控制效果,同时具有较强的设定值跟踪能力、抗干扰能力和鲁棒性。 展开更多
关键词 DDPG 强化学习 SCR脱硝系统 协同控制 多维状态 分段奖励函数
下载PDF
前景理论视角下废旧动力电池回收监管演化博弈分析 被引量:2
15
作者 许礼刚 刘荣福 +1 位作者 陈磊 倪俊 《重庆理工大学学报(自然科学)》 CAS 北大核心 2024年第1期290-297,共8页
废旧动力电池具有较强的负外部性,违背了新能源汽车设计的初衷。为促进废旧动力电池有效回收,将前景理论与演化博弈论耦合,综合考虑政府、企业(汽车生产)和公众之间的利益,促使政府和公众对企业进行共同监督,构建三方博弈模型。针对初... 废旧动力电池具有较强的负外部性,违背了新能源汽车设计的初衷。为促进废旧动力电池有效回收,将前景理论与演化博弈论耦合,综合考虑政府、企业(汽车生产)和公众之间的利益,促使政府和公众对企业进行共同监督,构建三方博弈模型。针对初始意愿、罚款组成、风险态度系数和损失规避系数的不同情况,进行模拟数值仿真,并结合现实中废旧动力电池的认识度、奖惩机制和盈利信心进行分析。研究表明:提高公众或政府的初始监督意愿,可以促进企业回收废旧动力电池;当企业的回收策略为亏损时,提高企业对公众的补偿金额、降低企业的风险态度系数和损失规避系数,可以促进企业积极回收;在废旧动力电池回收过程中,共同监督的效果优于单独监督。 展开更多
关键词 动力电池 演化博弈 前景理论 共同监督 奖惩机制
下载PDF
考虑含HRD的光热电站和综合需求响应的综合能源系统低碳经济调度 被引量:4
16
作者 王义军 孙健淳 +2 位作者 高敏 秦烨嵘 张希栋 《东北电力大学学报》 2024年第1期72-82,共11页
在“双碳”的背景下,为进一步提升综合能源系统(Integrated Energy System, IES)的经济性和环境效益,文中提出一种在奖惩阶梯型碳交易机制下考虑含热回收装置(Heat Recycling Device, HRD)的光热电站和综合需求响应的IES系统低碳经济调... 在“双碳”的背景下,为进一步提升综合能源系统(Integrated Energy System, IES)的经济性和环境效益,文中提出一种在奖惩阶梯型碳交易机制下考虑含热回收装置(Heat Recycling Device, HRD)的光热电站和综合需求响应的IES系统低碳经济调度方法。首先,在源侧构建含热回收装置的光热电站与加装碳捕集的热电联产机组联合运行的IES架构,并分析电转气两阶段的运行原理,建立计及余热回收的电转气设备模型。其次,考虑到负荷侧电、热、气三种负荷的柔性特性,在负荷侧建立电热气综合需求响应模型。最后,引入奖惩阶梯型碳交易机制,进一步减小系统碳排放量,构建调度周期内以含购能成本、运维成本、碳交易成本等系统总运行成本最小为目标的综合能源系统低碳优化调度模型。通过算例分析结果表明,所提方法不仅能提高机组的运行潜力,而且有效降低了系统总运行成本与碳排放量。 展开更多
关键词 光热电站 热回收装置 碳捕集技术 奖惩阶梯碳交易 综合需求响应
下载PDF
改进MADDPG算法的非凸环境下多智能体自组织协同围捕
17
作者 张红强 石佳航 +5 位作者 吴亮红 王汐 左词立 陈祖国 刘朝华 陈磊 《计算机科学与探索》 CSCD 北大核心 2024年第8期2080-2090,共11页
针对多智能体在非凸环境下的围捕效率问题,提出基于改进经验回放的多智能体强化学习算法。利用残差网络(ResNet)来改善网络退化问题,并与多智能体深度确定性策略梯度算法(MADDPG)相结合,提出了RW-MADDPG算法。为解决多智能体在训练过程... 针对多智能体在非凸环境下的围捕效率问题,提出基于改进经验回放的多智能体强化学习算法。利用残差网络(ResNet)来改善网络退化问题,并与多智能体深度确定性策略梯度算法(MADDPG)相结合,提出了RW-MADDPG算法。为解决多智能体在训练过程中,经验池数据利用率低的问题,提出两种改善经验池数据利用率的方法;为解决多智能体在非凸障碍环境下陷入障碍物内部的情况(如陷入目标不可达等),通过设计合理的围捕奖励函数使得智能体在非凸障碍物环境下完成围捕任务。基于此算法设计仿真实验,实验结果表明,该算法在训练阶段奖励增加得更快,能更快地完成围捕任务,相比MADDPG算法静态围捕环境下训练时间缩短18.5%,动态环境下训练时间缩短49.5%,而且在非凸障碍环境下该算法训练的围捕智能体的全局平均奖励更高。 展开更多
关键词 深度强化学习 RW-MADDPG 残差网络 经验池 围捕奖励函数
下载PDF
基于POW的区块链共识机制的改进
18
作者 谭敏生 徐国庆 +1 位作者 丁琳 夏石莹 《计算机应用与软件》 北大核心 2024年第11期117-122,共6页
基于工作量证明(Proof of Work,POW)的共识机制在寻找随机数(Nonce)过程中算力主导记账,浪费计算资源及内存,存在51%算力的危险。针对此缺陷,提出一种改进的基于POW的区块链共识机制IPOW(Improve Proof of Work),引入控制权重、激励阈... 基于工作量证明(Proof of Work,POW)的共识机制在寻找随机数(Nonce)过程中算力主导记账,浪费计算资源及内存,存在51%算力的危险。针对此缺陷,提出一种改进的基于POW的区块链共识机制IPOW(Improve Proof of Work),引入控制权重、激励阈值、有效时间和奖励因子,给出相关算法,通过控制权重等得出最终记账权R。实验结果表明,与POW相比,IPOW共识机制削弱了算力对于节点获得记账权的主导地位,控制权重越大,越容易获得记账权;降低节点作恶的概率,减少富人愈富现象的发生。 展开更多
关键词 区块链 工作量证明 激励阈值 控制权重 奖励因子
下载PDF
基于强化学习的多段连续体机器人轨迹规划
19
作者 刘宜成 杨迦凌 +1 位作者 梁斌 陈章 《电子测量技术》 北大核心 2024年第5期61-69,共9页
针对多段连续体机器人的轨迹规划问题,提出了一种基于深度确定性策略梯度强化学习的轨迹规划算法。首先,基于分段常曲率假设方法,建立连续体机器人的关节角速度和末端位姿的正向运动学模型。然后,采用强化学习算法,将机械臂的当前位姿... 针对多段连续体机器人的轨迹规划问题,提出了一种基于深度确定性策略梯度强化学习的轨迹规划算法。首先,基于分段常曲率假设方法,建立连续体机器人的关节角速度和末端位姿的正向运动学模型。然后,采用强化学习算法,将机械臂的当前位姿和目标位姿等信息作为状态输入,将机械臂的关节角速度作为智能体的输出动作,设置合理的奖励函数,引导机器人从初始位姿向目标位姿移动。最后,在MATLAB中搭建仿真系统,仿真结果表明,强化学习算法成功对多段连续体机器人进行轨迹规划,控制连续体机器人的末端平稳运动到目标位姿。 展开更多
关键词 连续体机器人 轨迹规划 强化学习 位姿控制 奖励引导
下载PDF
基于改进近端策略优化的空战自主决策研究
20
作者 钱殿伟 齐红敏 +2 位作者 刘振 周志明 易建强 《系统仿真学报》 CAS CSCD 北大核心 2024年第9期2208-2218,共11页
针对传统强化学习在空战自主决策应用中信息冗余度高、收敛速度慢等问题,提出一种基于双重观测与复合奖励的近端策略优化空战自主决策算法。设计了以交互信息为主、个体特征信息为辅的双重观测信息,降低战场信息高度冗余对训练效率的影... 针对传统强化学习在空战自主决策应用中信息冗余度高、收敛速度慢等问题,提出一种基于双重观测与复合奖励的近端策略优化空战自主决策算法。设计了以交互信息为主、个体特征信息为辅的双重观测信息,降低战场信息高度冗余对训练效率的影响;设计了结果奖励和过程奖励相结合的复合奖励函数,提高了训练过程收敛速度;采用广义优势函数估计,改进了近端策略优化算法,提高优势函数估计的准确性。仿真结果表明:在对战固定程控对手和矩阵博弈对手实验场景中,该算法决策模型均可根据战场态势准确进行自主决策,完成空战任务。 展开更多
关键词 强化学习 空战自主决策 双重观测 复合奖励 广义优势函数估计
下载PDF
上一页 1 2 176 下一页 到第
使用帮助 返回顶部