期刊文献+

基于Transformer的多方面特征编码图像描述生成算法 被引量:4

Multifaceted Feature Coding Image Caption Generation Algorithm Based on Transformer
下载PDF
导出
摘要 由目标检测算法提取的目标特征在图像描述生成任务中发挥重要作用,但仅使用对图像进行目标检测的特征作为图像描述任务的输入会导致除关键目标信息以外的其余信息获取缺失,且生成的文本描述对图像内目标之间的关系缺乏准确表达。针对上述不足,提出用于编码图像内目标特征的目标Transformer编码器,以及用于编码图像内关系特征的转换窗口Transformer编码器,从不同角度对图像内不同方面的信息进行联合编码。通过拼接方法将目标Transformer编码的目标特征与转换窗口Transformer编码的关系特征相融合,达到图像内部关系特征和局部目标特征融合的目的,最终使用Transformer解码器将融合后的编码特征解码生成对应的图像描述。在MS-COCO数据集上进行实验,结果表明,所构建模型性能明显优于基线模型,BLEU-4、METEOR、ROUGE-L、CIDEr指标分别达到38.6%、28.7%、58.2%和127.4%,优于传统图像描述网络模型,能够生成更详细准确的图像描述。 Object features extracted by object detection algorithms play an increasingly critical role in the generation of image captions.However,only using the features of object detection as the input of an image caption task can lead to the loss of other information except the key object information and generation of a caption that lacks an accurate expression of its relationship with the image object.To solve these disadvantages,an object Transformer encoder for encoding object features in an image and a shift window Transformer for encoding relational features in an image are proposed to make joint efforts to encode different aspects of information in an image.The object features of the object Transformer encoder are fused with the relational features of the shift window Transformer by splicing method,to achieve the purpose of fusion of the internal relational and local object features.Finally,a Transformer decoder is utilized to decode the fused coding features and generate the corresponding image caption.Extensive experiments on the Common Objects in COntext(MS-COCO) dataset and comparison with the current classical model algorithm show that the performance of the proposed model is significantly better than that of the baseline model.The experimental results indicate that the scores of BiLingual Evaluation Understudy 4-gram(BLEU-4),Metric for Evaluation of Translation with Explicit ORdering(METEOR),Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence(ROUGE-L),and Consensus-based Image Description Evaluation(CIDEr) metrics can reach 38.6%,28.7%,58.2% and 127.4%respectively,better than those of the traditional image caption algorithm.Moreover,it can generate more detailed and accurate captions.
作者 衡红军 范昱辰 王家亮 HENG Hongjun;FAN Yuchen;WANG Jialiang(School of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)
出处 《计算机工程》 CAS CSCD 北大核心 2023年第2期199-205,共7页 Computer Engineering
基金 国家自然科学基金(U1333109)。
关键词 图像描述 转换窗口 多头注意力机制 多模态任务 Transformer编码器 image caption shift window multi-headed attention mechanism multimodal task Transformer encoder
  • 相关文献

参考文献1

二级参考文献2

共引文献3

同被引文献41

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部