摘要
目前图像描述技术的主要架构是基于深度神经网络的Encoder-Decoder架构.大多数工作集中在图像的特征提取和注意力机制上,如hard注意力模型和top-down注意力模型等.这些方法仅使用上一时刻的信息预测当前时刻的输出,使得解码器的输入信息的时间维度单一,同时解码器的单个输出也影响着预测结果的准确性.本文提出横向和纵向的多时间维度信息融合的图像描述模型,其中模型的横向结构使用过去和现在时刻的语义信息丰富解码器的输入,模型的纵向结构同时生成现在和未来时刻的预测向量来丰富解码器的输出,模型两种独立结构的解码器都生成多个输出,然后分别对其进行加权融合作为模型两种结构的最终输出.在Flickr30k和MSCOCO两个数据集上的实验结果表明,模型的两种结构在多个评价指标上的得分超过了其他主流的模型,对图像的描述更准确.
The current mainstream architecture of image captioning technology is the Encoder-Decoder architecture based on deep neural networks.Most works focus on attention mechanism and the extraction of image features, such as hard attention model and top-down attention model.These methods only use the information from the previous moment to predict the output at the current moment, which results in single time dimension of the input information of the decoder.Meanwhile, the single output of the decoder also decreases the accuracy of the prediction result.This paper proposes a horizontal and vertical model of information fusion in multiple time dimensions.The horizontal structure of the model uses the semantic information of the past and present moments to enrich the input of the decoder, and the vertical structure of the model simultaneously generates prediction vectors of the present and future moments to enrich the output of the decoder.The decoders of the two independent structures of the model generate multiple outputs, then we respectively perform weighted fusion as the final output of the two structures of the model.Experiment results on Flickr30 k and MSCOCO datasets show that the scores of these two models on multiple evaluation indicators are higher than other mainstream models, and the descriptions of images generated by our models are more accurate compared with other mainstream models.
作者
李坤
周世斌
朱佳明
张国鹏
LI Kun;ZHOU Shi-bin;ZHU Jia-ming;ZHANG Guo-peng(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou 221116,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2022年第1期103-110,共8页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61971421,62071470)资助。
关键词
图像描述
解码器
多时间维度
注意力机制
image captioning
decoder
multi time dimensions
attention mechanism