The online 3D packing problem has received increasing attention in recent years due to its practical value. However, the problem itself possesses some peculiar properties, such as sequential decision-making and the la...The online 3D packing problem has received increasing attention in recent years due to its practical value. However, the problem itself possesses some peculiar properties, such as sequential decision-making and the large size of the state space, which have made the use of reinforcement learning with Markov decision processes a popular approach for solving this problem. In this paper, we focus on the problem of high variance in value estimation caused by reward uncertainty in the presence of highly uncertain dynamics. To address this, proposed a solution based on auxiliary tasks and intrinsic rewards for the online 3D bin packing problem, guided by a binary-valued network, to assist the agent in learning the policy within the framework of actor-critic deep reinforcement learning. Specifically, the maintenance of two-valued networks and the utilization of multi-valued network estimates are employed to replace the original value estimates, aiming to provide better guidance for the learning of policy networks. Experimentally, it has been demonstrated that our model can achieve more robust learning and outperform previous works in terms of performance.展开更多
文摘The online 3D packing problem has received increasing attention in recent years due to its practical value. However, the problem itself possesses some peculiar properties, such as sequential decision-making and the large size of the state space, which have made the use of reinforcement learning with Markov decision processes a popular approach for solving this problem. In this paper, we focus on the problem of high variance in value estimation caused by reward uncertainty in the presence of highly uncertain dynamics. To address this, proposed a solution based on auxiliary tasks and intrinsic rewards for the online 3D bin packing problem, guided by a binary-valued network, to assist the agent in learning the policy within the framework of actor-critic deep reinforcement learning. Specifically, the maintenance of two-valued networks and the utilization of multi-valued network estimates are employed to replace the original value estimates, aiming to provide better guidance for the learning of policy networks. Experimentally, it has been demonstrated that our model can achieve more robust learning and outperform previous works in terms of performance.