摘要
视觉强化学习以原始图像作为输入,面临着观测空间维度高、冗余信息多、采样效率低等挑战。现有研究大多通过构建一个自监督辅助任务来获取高维观测中的有效表征,然而这类方法仅关注状态特征,却忽略了动作空间中丰富的语义信息。针对这一问题,提出一种基于状态-动作联合掩码的自监督学习算法,通过对状态和动作进行联合掩码重构,从而学习到真正与任务相关的表征信息,提高采样效率。此外,为提高模型的鲁棒性,引入测试时自适应方法,在环境变化时冻结强化学习策略网络,仅对状态-动作联合掩码模块进行少量更新,利用自监督信号帮助智能体快速适应新环境。实验结果表明,该方法相比现有算法,在DMControl环境上的平均回报奖励提升了4.5%,在DMControl-GB环境上的平均回报奖励提升了20.2%,有效提升了模型的性能。
Visual reinforcement learning,which takes raw images as input,faces challenges such as high-dimensional observation spaces,abundant redundant information,and low sampling efficiency.Most existing studies construct a self-supervised auxiliary task to obtain effective representations from high-dimensional observations.However,these methods only focus on state features and neglect the rich semantic information present in the action space.To address this issue,we propose a state-action joint mask-based self-supervised learning algorithm.By jointly masking and reconstructing the state and action,the algorithm learns representations that are genuinely relevant to the task,thereby improving sampling efficiency.Furthermore,to enhance the robustness of the model,we introduce an test time adaptation method.The policy network is frozen when the environment changes,with only a few update to the joint state-action mask module.This approach utilizes self-supervised signals to help the agent quickly adapt to new environments.Experimental results demonstrate that compared to existing algorithms,the proposed method achieves a 4.5% increase in average reward on the DMControl environment and a 20.2% increase in average reward on the DMControl-GB environment,effectively enhancing the performance of the model.
作者
刘宇昕
项刘宇
何召锋
魏运
吴惠甲
王永钢
LIU Yu-xin;XIANG Liu-yu;HE Zhao-feng;WEI Yun;WU Hui-jia;WANG Yong-gang(School of Science,Beijing University of Posts and Telecommunications,Beijing 100876,China;School of Artificial Intelligence,Beijing University of Posts and Telecommunications,Beijing 100876,China;Beijing Subway Operation Co.,Limited,Beijing 100044,China)
出处
《计算机技术与发展》
2024年第11期125-132,共8页
Computer Technology and Development
基金
国家重点研发计划(2022YFB4501600)
国家自然科学基金(62176025)
北京市科技新星计划(20220484161)。
关键词
视觉强化学习
自监督学习
掩码模型
测试时自适应
鲁棒性
visual reinforcement learning
self-supervision learning
mask model
test time adaptation
robustness