摘要
近年来,深度学习已在图像字幕技术研究中展现其优势。在深度学习模型中,图像中对象之间的关系在图像表示中起着重要作用。为了更好地检测图像中的视觉关系,本文基于图神经网络和引导向量构建了图像字幕生成模型(YOLOv4-GCN-GRU,YGG)。该模型利用图像中被检测到的对象的空间和语义信息建立成图,利用图卷积神经网络(Graph convolutional network,GCN)作为编码器对图的每个区域进行表示。在字幕生成阶段,额外训练一个引导神经网络来产生引导向量,从而辅助生成模型自动生成语句。基于MSCOCO图像数据集的对比实验表明,YGG模型具有更好的性能,将CIDEr-D的性能从138.9%提高到了142.1%。
In recent years,deep learning has shown its advantages in the research of image caption technology.In deep learning model,the relationship between objects in image plays an important role in image representation.In order to better detect the visual relationship in the image,an image caption generation model(YOLOv4-GCN-GRU,YGG)is constructed based on graph neural network and guidance vector.The model uses the spatial and semantic information of the detected objects in the image to build a graph,and uses graph convolutional network(GCN)as an encoder to represent each region of the graph.In the process of decoding,an additional guidance neural network is trained to generate guidance vector,so as to assist the decoder to automatically generate sentences.Comparative experiments based on MSCOCO image dataset show that YGG model has better performance,and the performance of CIDEr-D is improved from 138.9%to 142.1%.
作者
佟国香
李乐阳
TONG Guoxiang;LI Yueyang(College of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处
《数据采集与处理》
CSCD
北大核心
2023年第1期209-219,共11页
Journal of Data Acquisition and Processing
基金
国家重点研发计划项目(2018YFB1700902)。
关键词
图像字幕
空间语义图
图卷积神经网络
引导向量
生成模型
image caption
spatial semantic map
graph convolution neural network
guidance vector
generation model