摘要
最近几年在深度学习领域中,自动生成一副图像的自然语言描述引发了学界的广泛关注,原因是图像描述在实际应用中的重要性以及它连接了两个重要的人工智能领域:计算机视觉和自然语言处理.以往的模型大多采用基于模板或简单的编码-解码方式,生成的文本结构较为单一并且不能够根据图像中各个物体的相互关系表达出图像的深层意义.提出了一种基于注意力机制与多模态的图像描述方法,在LSTM(Long-Short Term Memory)的基础上改进了Attention机制,并在Attention结构后面添加了多模态层对图像的上下文特征信息以及LSTM的隐层状态进行融合处理.在两个公共数据集:MS COCO以及Flickr 30K上进行验证,实验结果证明所提方法有效且可以使生成的描述语句更加丰富.
In recent years,in the field of deep learning an image generates the natural language description automatically generating an image has attracted widespread attention because of the importance of image description in practical applications and its connection to two important areas of artificial intelligence:computers Visual and natural language processing.Most of the previous models use template-based or simple encoding-decoding methods.The generated text structure is relatively simple and cannot express the deep meaning of the image according to the relationship between the objects in the image.In this paper,we propose an method of image description based on attention mechanism and multi-modality.The Attention mechanism is improved based on LSTM (Long-Short Term Memory),and a multi-modal layer is added after an attention structure.The contextual feature information and the hidden layer state of the LSTM are fused at the same time they are processed by the multimodality together.This paper verifies on two public data sets:MS COCO and Flickr 30K.The experimental results show that the proposed method is effective and can make the generated description statements more abundant.
作者
牛斌
李金泽
房超
马利
徐和然
纪兴海
NIU Bin;LI Jin-ze;FANG Chao;MA Li;XU He-ran;JI Xing-hai(College of Information,Liaoning University,Shenyang 110036,China;College of Information,Bohai University,Jinzhou 121001,China;65735 Troops of PLA,Shenyang 118005,China)
出处
《辽宁大学学报(自然科学版)》
CAS
2019年第1期38-45,共8页
Journal of Liaoning University:Natural Sciences Edition
基金
辽宁省科技厅博士科研启动基金指导计划项目(20170520276)