期刊文献+

基于注意力机制的多层次编码和解码的图像描述模型 被引量:3

Multi-layer encoding and decoding model for image captioning based on attention mechanism
下载PDF
导出
摘要 图像描述任务是图像理解的一个重要分支,它不仅要求能够正确识别图像的内容,还要求能够生成在语法和语义上正确的句子。传统的基于编码器−解码器的模型不能充分利用图像特征并且解码方式单一。针对这些问题,提出一种基于注意力机制的多层次编码和解码的图像描述模型。首先使用Faster R-CNN(Faster Region-based Convolutional Neural Network)提取图像特征,然后采用Transformer提取图像的3种高层次特征,并利用金字塔型的融合方式对特征进行有效融合,最后构建3个长短期记忆(LSTM)网络对不同层次特征进行层次化解码。在解码部分,利用软注意力机制使得模型能够关注当前步骤所需要的重要信息。在MSCOCO大型数据集上进行实验,利用多种指标(BLEU、METEOR、ROUGE-L、CIDEr)对模型进行评价,该模型在指标BLEU-4、METEOR和CIDEr上相较于Recall(Recall what you see)模型分别提升了2.5个百分点、2.6个百分点和8.8个百分点;相较于HAF(Hierarchical Attention-based Fusion)模型分别提升了1.2个百分点、0.5个百分点和3.5个百分点。此外,通过可视化生成的描述语句可以看出,所提出模型所生成的描述语句能够准确反映图像内容。 The task of image captioning is an important branch of image understanding.It requires not only the ability to correctly recognize the image content,but also the ability to generate grammatically and semantically correct sentences.The traditional encoder-decoder based model cannot make full use of image features and has only a single decoding method.In response to these problems,a multi-layer encoding and decoding model for image captioning based on attention mechanism named MLED was proposed.Firstly,Faster Region-based Convolutional Neural Network(Faster R-CNN)was used to extract image features.Then,Transformer was employed to extract three kinds of high-level features of the image.At the same time,the pyramid fusion method was used to effectively fuse the features.Finally,three Long Short-Term Memory(LSTM)Networks were constructed to decode the features of different layers hierarchically.In the decoding part,the soft attention mechanism was used to enable the model to pay attention to the important information required at the current step.The proposed model was tested on MSCOCO dataset and evaluated by BLEU,METEOR,ROUGE-L and CIDEr.Experimental results show that on the indicators BLEU-4,METEOR and CIDEr,the model is increased by 2.5 percentage points,2.6 percentage points and 8.8 percentage points compared to the Recall what you see(Recall)model respectively,and is improved by 1.2 percentage points,0.5 percentage points and 3.5 percentage points compared to the Hierarchical Attention-based Fusion(HAF)model respectively.The visualization of the generated description sentences show that the sentence generated by the proposed model can accurately reflect the image content.
作者 李康康 张静 LI Kangkang;ZHANG Jing(School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
出处 《计算机应用》 CSCD 北大核心 2021年第9期2504-2509,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61402174)。
关键词 图像描述 卷积神经网络 长短期记忆网络 多层次编码 多层次解码 注意力机制 image captioning Convolutional Neural Network(CNN) Long Short-Term Memory(LSTM)network multilayer encoding multi-layer decoding attention mechanism
  • 相关文献

参考文献3

二级参考文献6

共引文献9

同被引文献12

引证文献3

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部