期刊文献+

多时间维度信息融合的图像描述方法 被引量:1

Information Fusion in Multiple Time Dimensions for Image Captioning
下载PDF
导出
摘要 目前图像描述技术的主要架构是基于深度神经网络的Encoder-Decoder架构.大多数工作集中在图像的特征提取和注意力机制上,如hard注意力模型和top-down注意力模型等.这些方法仅使用上一时刻的信息预测当前时刻的输出,使得解码器的输入信息的时间维度单一,同时解码器的单个输出也影响着预测结果的准确性.本文提出横向和纵向的多时间维度信息融合的图像描述模型,其中模型的横向结构使用过去和现在时刻的语义信息丰富解码器的输入,模型的纵向结构同时生成现在和未来时刻的预测向量来丰富解码器的输出,模型两种独立结构的解码器都生成多个输出,然后分别对其进行加权融合作为模型两种结构的最终输出.在Flickr30k和MSCOCO两个数据集上的实验结果表明,模型的两种结构在多个评价指标上的得分超过了其他主流的模型,对图像的描述更准确. The current mainstream architecture of image captioning technology is the Encoder-Decoder architecture based on deep neural networks.Most works focus on attention mechanism and the extraction of image features, such as hard attention model and top-down attention model.These methods only use the information from the previous moment to predict the output at the current moment, which results in single time dimension of the input information of the decoder.Meanwhile, the single output of the decoder also decreases the accuracy of the prediction result.This paper proposes a horizontal and vertical model of information fusion in multiple time dimensions.The horizontal structure of the model uses the semantic information of the past and present moments to enrich the input of the decoder, and the vertical structure of the model simultaneously generates prediction vectors of the present and future moments to enrich the output of the decoder.The decoders of the two independent structures of the model generate multiple outputs, then we respectively perform weighted fusion as the final output of the two structures of the model.Experiment results on Flickr30 k and MSCOCO datasets show that the scores of these two models on multiple evaluation indicators are higher than other mainstream models, and the descriptions of images generated by our models are more accurate compared with other mainstream models.
作者 李坤 周世斌 朱佳明 张国鹏 LI Kun;ZHOU Shi-bin;ZHU Jia-ming;ZHANG Guo-peng(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou 221116,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2022年第1期103-110,共8页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61971421,62071470)资助。
关键词 图像描述 解码器 多时间维度 注意力机制 image captioning decoder multi time dimensions attention mechanism
  • 相关文献

参考文献3

二级参考文献6

共引文献30

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部