期刊文献+

基于跨模态多维关系增强的多模态模型研究

Research on multi-modal model based on cross-modal multi-dimensional relationship enhancement
下载PDF
导出
摘要 针对当前多模态模型不能充分挖掘图像中非显著区域的空间关系和上下文间的语义关系,导致多模态关系推理效果不佳的问题,提出了一个基于跨模态多维关系增强的多模态模型(multi-dimensional relationship enhancement model,MRE),用于提取潜层结构下图像各要素之间的空间关系信息,并推理出视觉—语言间的语义相关性。设计了特征多样性模块用于挖掘图像中与显著区域相关的次显著区域特征,从而增强图像空间关系特征表示。同时设计了上下文引导注意模块来引导模型学习语言上下文在图像中的关系,实现跨模态关系对齐。在MSCOCO数据集上的实验表明所提模型获得了更好的性能,其中BLEU-4和CIDEr分数分别提升了0.5%和1.3%。将这种方法应用到视觉问答任务中,在VQA 2.0数据集上性能得到了0.62%的提升,证明了该方法在多模态任务方面的广泛适用性。 Aiming at the problem that the current multi-modal models can’t fully excavate the spatial relationship of non-significant regions and the semantic relationship between contexts,resulting in poor inference of multimodal relationship,this paper proposed a multi-modal model based on cross-modal multi-dimensional relationship enhancement,which was used to extract the spatial relation information between the image elements under the latent layer structure,and reasoning the semantic correlation between visual and language.Firstly,the model designed a feature diversity module to mine the sub-significant region features associated with significant regions in the image,thus enhancing the image spatial relationship feature representation.Secondly,it learned the context relationship of language in the image by the context guided attention module to achieve cross modal relationship alignment.Experiments on the MSCOCO dataset show that the proposed model achieves better performance,with BLEU-4 and CIDEr scores are improved by 0.5%and 1.3%,respectively.This approach is also applied to the visual question answering task,and the performance is improved by 0.62%on the VQA 2.0 dataset,which proves the wide applicability of the approach in multimodal tasks.
作者 成曦 杨关 刘小明 刘阳 Cheng Xi;Yang Guan;Liu Xiaoming;Liu Yang(School of Computer Science,Zhengzhou 450007,China;Henan Key Laboratory on Public Opinion Intelligent Analysis,Zhongyuan University of Technology,Zhengzhou 450007,China;School of Telecommunications Engineering,Xidian University,Xi’an 710071,China)
出处 《计算机应用研究》 CSCD 北大核心 2023年第8期2367-2374,共8页 Application Research of Computers
基金 国家自然科学基金青年资助项目(61906141) 河南省高等学校重点科研资助项目(23A520022) 东北师范大学应用统计教育部重点实验室资助项目(135131007)。
关键词 图像描述 视觉问答 特征多样性 空间关系 上下文语义关系 特征融合 多模态编码 image description visual question answering feature diversification spatial relationship contextual semantic relationship feature fusion multimodal encoding
  • 相关文献

参考文献2

二级参考文献6

共引文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部