期刊文献+

强化学习离线策略评估研究综述 被引量:1

Research on Off-Policy Evaluation in Reinforcement Learning:A Survey
下载PDF
导出
摘要 在强化学习应用中,为避免意外风险,需要在强化学习实际部署前进行离线策略评估(Off-Policy Evaluation,OPE),这在机器人、自动驾驶等领域产生了巨大的应用前景.离线策略评估是从行为策略收集到的轨迹数据中,不需要通过实际的强化学习而估计目标策略的状态价值,通常情况下学习目标是使所估计的目标策略状态价值与目标策略真实执行的状态价值均方误差尽可能小.行为策略与目标策略间的差异性,以及新应用中出现的行为策略奖励稀疏性,不断给离线策略评估带来了挑战.本文系统性地梳理了近二十年离线策略评估的主要方法:纯模型法、重要性采样法、混合模型法和PU学习法(Positive Unlabeled,PU),主要内容包括:(1)描述了离线策略评估的相关理论背景知识;(2)分别阐述了各类方法的机理、方法中模型的细节差异;(3)详细对各类方法及模型进行了机理对比,并通过实验进行了主流离线策略评估模型的程序复现与性能对比.最后展望了离线策略评估的技术挑战与可能发展方向. Among reinforcement learning(RL)applications,the off-policy evaluation(OPE)is employed to avoid unexpected risks before the actual deployment,which has been applied in many fields,including the robot,autonomous driving,and so on.The core learning goal of OPE is to minimize the mean squared error of state values between the new estimate and executed values from the target policy.With the advantage of OPE,researchers can estimate target policy’s state values through collected history trajectories before RL execution,further realize an evaluation of reinforcement learning policy and perform necessary policy control in advance.In recent years,on the one hand,OPE has drawn many interests from both researchers and engineers.On the other hand,due to the difference between behavior policy and target policy and possible reward sparseness of behavior policy,OPE also faces emerging challenges to overcome.This paper systematically summarizes the state-of-the-art OPE methods in latest twenty years.These methods mainly fall into four categories:directed-model based,importance-sampling based,hybrid-model based,and PU-learning based methods.For directed-model-based methods,we describe value iteration estimator and direct model(DM).Among importance-sampling based methods,we present Importance Sampling(IS),Step Importance Sampling(step-IS),Weighted Importance Sampling(WIS),Step Weighted Importance Sampling(WIS),Regression Importance Sampling(RIS),Marginalized Importance Sampling(MIS),Incremental Importance Sampling(INCRIS).We compare hybrid-model based methods,which combines the directed-model based with the importance-sampling based methods,including Doubly Robust(DR),Weighted Doubly Robust(WDR),Model and Guided Importance Sampling Combined(MAGIC),Longitudinal Targeted Maximum Likelihood Estimator(LTMLE),TMLE for RL(RLTMLE).We also introduce the PU-learning-based method involving Off-policy Classification(OPC),Soft Off-policy Classification(SoftOPC).In addition to present fundamental theoretical knowledge of OPE,we focus on analyzing different OPE methods’differences in detail and discussing essential OPE mechanisms.We also implement the mainstream OPE models and compare their performance to provide more concrete results for better understanding.We experiment with these models in five typical RL applications:ModelFail,ModelWin,and GridWorld,Fappybird,and SpaceInvaders-v0.We find that no single OPE method is consistently the best performer through our analysis,but hybrid-model-based methods are generally outperformed importance sampling methods.Most OPE estimators have strict constraints and do not perform well in Horizon length.Compared with DM estimators in directed-model-based methods,WDR,MAGIC,and RLTMLE estimators in hybrid-model-based methods are not always optimal in the ModelWin environment.We also notice that MAGIC and RLTMLE perform well in most cases.Still,when the historical trajectory data are relatively less,the MSE between the estimated state value of the target policy and the state value of the actual execution of the target policy is relatively big.Finally,we present future challenges and possible developing directions of OPE.Due to the difference between behavior policy and target policy and possible reward sparseness of behavior policy in some emerging applications,OPE still has a lot of challenges on its way.Happily noting that Google’s PU-learning method breaks some OPE strict constraints,giving inspiration for future OPE researches and grounding OPE applications.
作者 王硕汝 牛温佳 童恩栋 陈彤 李赫 田蕴哲 刘吉强 韩臻 李浥东 WANG Shuo-Ru;NIU Wen-Jia;TONG En-Dong;CHEN Tong;LI He;TIAN Yun-Zhe;LIU Ji-Qiang;HAN Zhen;LI Yi-Dong(Beijing Key Laboratory of Security and Privacy in Intelligent Transportation,Beijing Jiaotong University,Beijing 100044)
出处 《计算机学报》 EI CAS CSCD 北大核心 2022年第9期1926-1945,共20页 Chinese Journal of Computers
基金 国家自然科学基金(61972025,61802389,61672092,U1811264,61966009) 国家重点研发计划(2020YFB1005604,2020YFB2103802)资助.
关键词 人工智能 强化学习 离线策略评估 重要性采样 PU学习 artificial intelligence reinforcement learning off-policy evaluation importance sampling PU learning
  • 相关文献

同被引文献7

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部