人物交互(HOI)检测旨在定位图像中的人和物体,并对它们之间的交互进行分类。实用的HOI检测系统执行以人为中心的场景理解,因此对许多应用具有巨大的潜在影响,如监视事件检测和机器人模仿学习。随着最近Transformer网络在目标检测方面的...人物交互(HOI)检测旨在定位图像中的人和物体,并对它们之间的交互进行分类。实用的HOI检测系统执行以人为中心的场景理解,因此对许多应用具有巨大的潜在影响,如监视事件检测和机器人模仿学习。随着最近Transformer网络在目标检测方面的成功,基于Transformer的HOI检测方法已被积极开发,引领了近期HOI关系检测研究的进步。基于Transformer的HOI检测方法利用Transformer的自注意力机制来提取上下文语义信息和嵌入来表示HOI实例,成为HOI检测任务的新趋势。本文综述了现有方法的最新研究进展,并将其分为四类:早期端到端模型、利用DETR变体和改进骨干网络的模型、语言–图像预训练的模型以及基于DETR的两阶段模型。系统地阐述目前基于Transformer的HOI检测方法的发展现状,分析各种流派的优缺点,梳理该领域方法的发展脉络,最后对未来的研究方向进行展望。Human-Object Interaction (HOI) detection aims to localize humans and objects in an image and classify their interactions. Practical HOI detection systems enable human-centric scene understanding, thus holding significant potential impact on various applications such as surveillance event detection and robot imitation learning. With the recent success of Transformer networks in object detection, Transformer-based HOI detection methods have been actively developed, leading to advancements in recent research on HOI relation detection. Transformer-based HOI detection methods leverage the self-attention mechanism of Transformers to extract contextual semantic information and embeddings to represent HOI instances, becoming a new trend in HOI detection tasks. This paper reviews the latest research progress of existing methods, categorizing them into four types: early end-to-end models, models using variants of DETR and improved backbone networks, language-image pre-trained models, and two-stage models based on DETR. It systematically elaborates on the current development status of Transformer-based HOI detection methods, analyzes the advantages and disadvantages of various approaches, outlines the development trajectory of methods in this field, and finally provides prospects for future research directions.展开更多
文摘人物交互(HOI)检测旨在定位图像中的人和物体,并对它们之间的交互进行分类。实用的HOI检测系统执行以人为中心的场景理解,因此对许多应用具有巨大的潜在影响,如监视事件检测和机器人模仿学习。随着最近Transformer网络在目标检测方面的成功,基于Transformer的HOI检测方法已被积极开发,引领了近期HOI关系检测研究的进步。基于Transformer的HOI检测方法利用Transformer的自注意力机制来提取上下文语义信息和嵌入来表示HOI实例,成为HOI检测任务的新趋势。本文综述了现有方法的最新研究进展,并将其分为四类:早期端到端模型、利用DETR变体和改进骨干网络的模型、语言–图像预训练的模型以及基于DETR的两阶段模型。系统地阐述目前基于Transformer的HOI检测方法的发展现状,分析各种流派的优缺点,梳理该领域方法的发展脉络,最后对未来的研究方向进行展望。Human-Object Interaction (HOI) detection aims to localize humans and objects in an image and classify their interactions. Practical HOI detection systems enable human-centric scene understanding, thus holding significant potential impact on various applications such as surveillance event detection and robot imitation learning. With the recent success of Transformer networks in object detection, Transformer-based HOI detection methods have been actively developed, leading to advancements in recent research on HOI relation detection. Transformer-based HOI detection methods leverage the self-attention mechanism of Transformers to extract contextual semantic information and embeddings to represent HOI instances, becoming a new trend in HOI detection tasks. This paper reviews the latest research progress of existing methods, categorizing them into four types: early end-to-end models, models using variants of DETR and improved backbone networks, language-image pre-trained models, and two-stage models based on DETR. It systematically elaborates on the current development status of Transformer-based HOI detection methods, analyzes the advantages and disadvantages of various approaches, outlines the development trajectory of methods in this field, and finally provides prospects for future research directions.