基于改进Transformer的多尺度图像描述生成

Multi-scale Image Captioning Generation Based on Improved Transformer

下载PDF

导出

摘要 Transformer模型被广泛应用于图像描述生成任务中,但存在以下问题:(1)依赖复杂神经网络对图像进行预处理;(2)自注意力具有二次计算复杂度;(3)Masked Self-Attention缺少图像引导信息。为此,提出改进Transformer的多尺度图像描述生成模型。首先,将图像划分为多尺度图像块以获取多层次图像特征,并将其通过线性映射作为Transformer的输入,避免了复杂神经网络预处理的步骤,从而提升了模型训练与推理速度;其次,在编码器中使用线性复杂度的记忆注意力,通过可学习的共享记忆单元学习整个数据集的先验知识,挖掘样本间潜在的相关性;最后,在解码器中引入视觉引导注意力,将视觉特征作为辅助信息指导解码器生成与图像内容更为匹配的语义描述。在COCO2014数据集上的测试结果表明,与基础模型相比,改进模型在CIDEr、METEOR、ROUGE和SPICE指标分数方面分别提高了2.6、0.7、0.4、0.7。基于改进Transformer的多尺度图像描述生成模型能生成更加准确的语言描述。 The Transformer model is widely used in image description generation tasks,but it has the following problems:①relying on com-plex neural networks for image preprocessing;②Self attention has a quadratic computational complexity;③Masked Self Attention lacks im-age guidance information.To this end,an improved Transformer based multi-scale image description generation model is proposed.Firstly,the image is divided into multi-scale image blocks to obtain multi-level image features,which are then linearly mapped as input to the Trans-former,avoiding the steps of complex neural network preprocessing and improving model training and inference speed;Then,linear complexi-ty memory attention is used in the encoder to learn the prior knowledge of the entire dataset through learnable shared memory units and explore potential correlations between samples;Finally,visual guided attention is introduced into the decoder,using visual features as auxiliary infor-mation to guide the decoder in generating semantic descriptions that better match the image content.The test results on the COCO 2014 dataset show that compared to the base model,the improved model has improved scores on CIDEr,METEOR,ROUGE,and SPICE indicators by 2.6,0.7,0.4,and 0.7,respectively.The multi-scale image description generation model based on improved Transformer can generate more accurate language descriptions.

作者崔衡张海涛杨剑杜宝昌 CUI Heng;ZHANG Haitao;YANG Jian;DU Baochang(Software College,Liaoning Technical University,Huludao 125105,China;Computer Department,Shantou Polytechnic,Shantou 515071,China;School of Geospatial Information,Information Engineering University,Zhengzhou 450052,China)

机构地区辽宁工程技术大学软件学院汕头职业技术学院计算机系信息工程大学地理空间信息学院

出处《软件导刊》 2024年第7期160-166,共7页 Software Guide

基金国家自然科学基金项目(42130112) 国家重点研发计划项目(2017YFB0503500) KartoBit Research Network开放课题基金项目(KRN2201CA)。

关键词图像描述 Transformer模型记忆注意力多尺度图像自注意力 image captioning Transformer model memory attention multi-scale image self-attention

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1黄启航,程昊阳,王然.双目立体视觉室内场景描述模型[J].福建电脑,2024,40(7):23-28.
2刘明明,刘浩,王栋,张海燕.基于全局与序列变分自编码的图像描述生成[J].计算机应用研究,2024,41(7):2215-2220.
3杜皓,毛润彰,邓蕴桐,黄思路,徐小文.MiniBranRAP:极小化分支判断数的AMG粗网格矩阵计算并行算法[J].计算机工程与科学,2024,46(7):1158-1166.
4雷菁,王劲夫,杨飞然,杨军.神经网络辅助估计先验语音存在概率的多通道降噪方法[J].信号处理,2024,40(7):1197-1207.
5Emran Al-Buraihy,Dan Wang.Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment[J].Computers, Materials & Continua,2024,79(6):3913-3938.

软件导刊

2024年第7期

浏览历史

内容加载中请稍等...

基于改进Transformer的多尺度图像描述生成

相关作者

相关机构

相关主题

浏览历史