基于跨模态多维关系增强的多模态模型研究

Research on multi-modal model based on cross-modal multi-dimensional relationship enhancement

下载PDF

导出

摘要针对当前多模态模型不能充分挖掘图像中非显著区域的空间关系和上下文间的语义关系,导致多模态关系推理效果不佳的问题,提出了一个基于跨模态多维关系增强的多模态模型(multi-dimensional relationship enhancement model,MRE),用于提取潜层结构下图像各要素之间的空间关系信息,并推理出视觉—语言间的语义相关性。设计了特征多样性模块用于挖掘图像中与显著区域相关的次显著区域特征,从而增强图像空间关系特征表示。同时设计了上下文引导注意模块来引导模型学习语言上下文在图像中的关系,实现跨模态关系对齐。在MSCOCO数据集上的实验表明所提模型获得了更好的性能,其中BLEU-4和CIDEr分数分别提升了0.5%和1.3%。将这种方法应用到视觉问答任务中,在VQA 2.0数据集上性能得到了0.62%的提升,证明了该方法在多模态任务方面的广泛适用性。 Aiming at the problem that the current multi-modal models can’t fully excavate the spatial relationship of non-significant regions and the semantic relationship between contexts,resulting in poor inference of multimodal relationship,this paper proposed a multi-modal model based on cross-modal multi-dimensional relationship enhancement,which was used to extract the spatial relation information between the image elements under the latent layer structure,and reasoning the semantic correlation between visual and language.Firstly,the model designed a feature diversity module to mine the sub-significant region features associated with significant regions in the image,thus enhancing the image spatial relationship feature representation.Secondly,it learned the context relationship of language in the image by the context guided attention module to achieve cross modal relationship alignment.Experiments on the MSCOCO dataset show that the proposed model achieves better performance,with BLEU-4 and CIDEr scores are improved by 0.5%and 1.3%,respectively.This approach is also applied to the visual question answering task,and the performance is improved by 0.62%on the VQA 2.0 dataset,which proves the wide applicability of the approach in multimodal tasks.

作者成曦杨关刘小明刘阳 Cheng Xi;Yang Guan;Liu Xiaoming;Liu Yang(School of Computer Science,Zhengzhou 450007,China;Henan Key Laboratory on Public Opinion Intelligent Analysis,Zhongyuan University of Technology,Zhengzhou 450007,China;School of Telecommunications Engineering,Xidian University,Xi’an 710071,China)

机构地区中原工学院计算机学院中原工学院河南省网络舆情监测与智能分析重点实验室西安电子科技大学通讯工程学院

出处《计算机应用研究》 CSCD 北大核心 2023年第8期2367-2374,共8页 Application Research of Computers

基金国家自然科学基金青年资助项目(61906141) 河南省高等学校重点科研资助项目(23A520022) 东北师范大学应用统计教育部重点实验室资助项目(135131007)。

关键词图像描述视觉问答特征多样性空间关系上下文语义关系特征融合多模态编码 image description visual question answering feature diversification spatial relationship contextual semantic relationship feature fusion multimodal encoding

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]