期刊文献+

基于改进Transformer的多尺度图像描述生成

Multi-scale Image Captioning Generation Based on Improved Transformer
下载PDF
导出
摘要 Transformer模型被广泛应用于图像描述生成任务中,但存在以下问题:(1)依赖复杂神经网络对图像进行预处理;(2)自注意力具有二次计算复杂度;(3)Masked Self-Attention缺少图像引导信息。为此,提出改进Transformer的多尺度图像描述生成模型。首先,将图像划分为多尺度图像块以获取多层次图像特征,并将其通过线性映射作为Transformer的输入,避免了复杂神经网络预处理的步骤,从而提升了模型训练与推理速度;其次,在编码器中使用线性复杂度的记忆注意力,通过可学习的共享记忆单元学习整个数据集的先验知识,挖掘样本间潜在的相关性;最后,在解码器中引入视觉引导注意力,将视觉特征作为辅助信息指导解码器生成与图像内容更为匹配的语义描述。在COCO2014数据集上的测试结果表明,与基础模型相比,改进模型在CIDEr、METEOR、ROUGE和SPICE指标分数方面分别提高了2.6、0.7、0.4、0.7。基于改进Transformer的多尺度图像描述生成模型能生成更加准确的语言描述。 The Transformer model is widely used in image description generation tasks,but it has the following problems:①relying on com-plex neural networks for image preprocessing;②Self attention has a quadratic computational complexity;③Masked Self Attention lacks im-age guidance information.To this end,an improved Transformer based multi-scale image description generation model is proposed.Firstly,the image is divided into multi-scale image blocks to obtain multi-level image features,which are then linearly mapped as input to the Trans-former,avoiding the steps of complex neural network preprocessing and improving model training and inference speed;Then,linear complexi-ty memory attention is used in the encoder to learn the prior knowledge of the entire dataset through learnable shared memory units and explore potential correlations between samples;Finally,visual guided attention is introduced into the decoder,using visual features as auxiliary infor-mation to guide the decoder in generating semantic descriptions that better match the image content.The test results on the COCO 2014 dataset show that compared to the base model,the improved model has improved scores on CIDEr,METEOR,ROUGE,and SPICE indicators by 2.6,0.7,0.4,and 0.7,respectively.The multi-scale image description generation model based on improved Transformer can generate more accurate language descriptions.
作者 崔衡 张海涛 杨剑 杜宝昌 CUI Heng;ZHANG Haitao;YANG Jian;DU Baochang(Software College,Liaoning Technical University,Huludao 125105,China;Computer Department,Shantou Polytechnic,Shantou 515071,China;School of Geospatial Information,Information Engineering University,Zhengzhou 450052,China)
出处 《软件导刊》 2024年第7期160-166,共7页 Software Guide
基金 国家自然科学基金项目(42130112) 国家重点研发计划项目(2017YFB0503500) KartoBit Research Network开放课题基金项目(KRN2201CA)。
关键词 图像描述 Transformer模型 记忆注意力 多尺度图像 自注意力 image captioning Transformer model memory attention multi-scale image self-attention
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部