摘要
近年来,视觉问答逐渐成为计算机视觉领域的研究热点之一。目前大多数研究是围绕2D图像的问答,但2D图像存在由视点改变、遮挡和重投影引入的空间模糊性。现实生活中,人机交互的场景往往是3D的,研究3D问答更具实际应用价值。已有的3D问答算法能感知3D对象以及它们的空间关系,并能回答意义复杂的问题。但是,由点云组成的3D场景和问题属于两种模态的数据,这两种模态数据之间存在明显的差异,难以对齐,两者潜在的相关特征容易被忽略。针对这一问题,提出了一种基于自监督学习的3D真实场景问答方法。该方法首次在3D问答模型中引入对比学习,通过3D跨模态对比学习对齐3D场景和问题,缩小两种模态的异构差距,挖掘两者的相关特征。此外,将深度交互注意力网络用于处理3D场景和问题,对3D场景中的对象和问题中的关键词做充分的交互。在ScanQA数据集上进行的大量实验表明,3DSSQA在EM@1这个主要指标上的准确度达到了24.3%,超过了目前最先进的模型。
Visual question answering(VQA)has gradually become one of the research hotspots in recent years.Most of the current question-answering research is 2D-image-based,often suffering from spatial ambiguity introduced by viewpoint changing,occlusion,and reprojection.In practice,human-computer interaction scenarios are often three-dimensional,yielding the demand for 3D-scene-based question answering.Existing 3D question answering algorithms have so far been able to perceive 3D objects and their spatial relationships,and can answer complex questions.However,point clouds represented by 3D scenes and the target questions belong to two different modalities,which are extremely difficult to align,leading to their unconspicuous related features are easy to be ignored.Aiming at this problem,this paper proposes a novel learning-based question answering method for realistic 3D scenes,called 3D self-supervised question answering(3DSSQA).Within 3DSSQA,a 3D cross-modal contrastive learning model(3DCMCL)is proposed to first align point-cloud data with question data globally for modality heterogeneity gap reduction,before mining related features between the two.In addition,a deep interactive attention(DIA)network is adapted to align 3D objects with keywords in a more fine-grained granularity,facilitating sufficient interactions between them.Extensive experiments on the ScanQA dataset demonstrate that 3DSSQA achieves an accuracy of 24.3%on the main EM@1 metric,notably surpassing state-of-the-art models.
作者
李祥
范志广
林楠
曹仰杰
李学相
LI Xiang;FAN Zhiguang;LIN Nan;CAO Yangjie;LI Xuexiang(School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou 450000,China;School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510000,China)
出处
《计算机科学》
CSCD
北大核心
2023年第9期220-226,共7页
Computer Science
基金
国家自然科学基金面上项目(61972092)
郑州市协同创新重大专项(20XTZX06013)。
关键词
3D问答
自监督学习
对比学习
点云
深度交互注意力
3D question answering
Self-supervised learning
Contrastive learning
Point clouds
Deep interactive attention