自监督学习用于3D真实场景问答

Self-supervised Learning for 3D Real-scenes Question Answering

下载PDF

导出

摘要近年来,视觉问答逐渐成为计算机视觉领域的研究热点之一。目前大多数研究是围绕2D图像的问答,但2D图像存在由视点改变、遮挡和重投影引入的空间模糊性。现实生活中,人机交互的场景往往是3D的,研究3D问答更具实际应用价值。已有的3D问答算法能感知3D对象以及它们的空间关系,并能回答意义复杂的问题。但是,由点云组成的3D场景和问题属于两种模态的数据,这两种模态数据之间存在明显的差异,难以对齐,两者潜在的相关特征容易被忽略。针对这一问题,提出了一种基于自监督学习的3D真实场景问答方法。该方法首次在3D问答模型中引入对比学习,通过3D跨模态对比学习对齐3D场景和问题,缩小两种模态的异构差距,挖掘两者的相关特征。此外,将深度交互注意力网络用于处理3D场景和问题,对3D场景中的对象和问题中的关键词做充分的交互。在ScanQA数据集上进行的大量实验表明,3DSSQA在EM@1这个主要指标上的准确度达到了24.3%,超过了目前最先进的模型。 Visual question answering(VQA)has gradually become one of the research hotspots in recent years.Most of the current question-answering research is 2D-image-based,often suffering from spatial ambiguity introduced by viewpoint changing,occlusion,and reprojection.In practice,human-computer interaction scenarios are often three-dimensional,yielding the demand for 3D-scene-based question answering.Existing 3D question answering algorithms have so far been able to perceive 3D objects and their spatial relationships,and can answer complex questions.However,point clouds represented by 3D scenes and the target questions belong to two different modalities,which are extremely difficult to align,leading to their unconspicuous related features are easy to be ignored.Aiming at this problem,this paper proposes a novel learning-based question answering method for realistic 3D scenes,called 3D self-supervised question answering(3DSSQA).Within 3DSSQA,a 3D cross-modal contrastive learning model(3DCMCL)is proposed to first align point-cloud data with question data globally for modality heterogeneity gap reduction,before mining related features between the two.In addition,a deep interactive attention(DIA)network is adapted to align 3D objects with keywords in a more fine-grained granularity,facilitating sufficient interactions between them.Extensive experiments on the ScanQA dataset demonstrate that 3DSSQA achieves an accuracy of 24.3%on the main EM@1 metric,notably surpassing state-of-the-art models.

作者李祥范志广林楠曹仰杰李学相 LI Xiang;FAN Zhiguang;LIN Nan;CAO Yangjie;LI Xuexiang(School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou 450000,China;School of Computer Science and Engineering,Sun Yat-sen University,Guangzhou 510000,China)

机构地区郑州大学网络空间安全学院中山大学计算机学院

出处《计算机科学》 CSCD 北大核心 2023年第9期220-226,共7页 Computer Science

基金国家自然科学基金面上项目(61972092) 郑州市协同创新重大专项(20XTZX06013)。

关键词 3D问答自监督学习对比学习点云深度交互注意力 3D question answering Self-supervised learning Contrastive learning Point clouds Deep interactive attention

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1尹敏.国土空间规划背景下城市旅游规划的定位与转型[J].社会科学家,2021,36(9):61-65. 被引量：2

计算机科学

2023年第9期

浏览历史

内容加载中请稍等...

自监督学习用于3D真实场景问答

相关作者

相关机构

相关主题

浏览历史