期刊文献+

基于差异化和空间约束的自动图像描述模型

Image captioning model based on divergence-based and spatial consistency constraints
下载PDF
导出
摘要 多头注意力机制是图像描述模型的常用方法,该机制通过多分支结构构建关于输入特征的独特属性,以提高特征模型的区分性。然而,不同分支的独立性导致建模存在冗余性。同时,注意力机制会关注于不重要的图像区域,导致描述的文本不够准确。针对上述问题,提出一种损失函数作为训练目标的正则化项,以提高多头注意力机制的多样性和准确性。在多样性方面,提出一种多头注意力的差异化正则,鼓励多头注意力机制的不同分支关注于所描述目标的不同部件,使不同分支的建模目标变得简单。同时,不同分支相互融合,最后形成完整且更有区分性的视觉描述。在准确性方面,设计一种空间一致性正则。通过建模多头注意力机制的空间关联,鼓励注意力机制关注的图像区域尽可能集中,从而抑制背景区域的影响,提高注意力机制的准确性。提出差异化正则和空间一致性正则共同作用的方法,最终提升自动图像描述模型的准确性。所提方法在MS COCO数据集上对模型进行验证,并与多种代表性工作进行对比。实验结果表明:所提方法显著地提高了图像描述的准确性。 The multi-head attention mechanism has been widely adopted in image captioning.It is appealing for the ability to jointly attend to information from different representation subspaces.However,as each head captures distinct properties of the input individually,the diversity between heads’representations is not guaranteed.In the meanwhile,most existing attention models encounter the problem of“attention defocus”,i.e.,they fail to concentrate on correct image regions when generating the target words.Consequently,the generated sentences are not accurate enough.To address these problems,we propose a novel training objective that serves as an auxiliary regularization function to improve the diversity and accuracy of the multi-head attention mechanism.In the beginning,we present a divergence-based regularization that encourages each brain to concentrate on various areas of the goal.Partial representations are aggregated to produce distinct representations of the target.Secondly,we introduce a spatial consistency regularization that builds the spatial relationship among the attended regions.By encouraging the attended regions to be focussed,it enhances image captioning.We proposed a method for the joint action of divergence-based regularization and spatial consistency regularization.We compare the performance of the proposed method with state-of-the-art methods on challenging MS COCO datasets.The experimental results demonstrate the superior performance of the proposed method.
作者 姜文晖 陈志亮 程一波 方玉明 左一帆 JIANG Wenhui;CHEN Zhiliang;CHENG Yibo;FANG Yuming;ZUO Yifan(School of Information Management,Jiangxi University of Finance and Economics,Nanchang 330032,China)
出处 《北京航空航天大学学报》 EI CAS CSCD 北大核心 2024年第2期456-465,共10页 Journal of Beijing University of Aeronautics and Astronautics
基金 国家自然科学基金(62161013,62162029) 江西省重点研发计划项目(20203BBE53033) 江西省自然科学基金(20224BAB212010,20212BAB202011,20224BAB212012,20232BAB202001)。
关键词 多头注意力机制 图像描述 差异性 空间约束 模态融合 multi-head attention mechanism image captioning diversity spatial consistency model fusion
  • 相关文献

参考文献3

二级参考文献6

共引文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部