摘要
传统的视频字幕生成模型大多都采用编码器—译码器框架。在编码阶段,使用卷积神经网络对视频进行处理。在解码阶段,使用长短期记忆网络生成视频的相应字幕。基于视频的时序相关性和多模态性,提出了一个混合型模型,即基于硬注意力的多模态视频字幕的生成模型。该模型在编码阶段使用不同的融合模型将视频和音频两种模态进行关联,在解码阶段基于长短期记忆网络的基础上加入了硬注意力机制来生成对视频的描述。这个混合模型在数据集MSR-VTT(Microsoft research video to text)上得到的机器翻译指标较基础模型有0.2%~3.8%的提升。根据实验结果可以判定基于硬注意力机制的多模态混合模型可以生成视频的精准描述字幕。
Most of the traditional video caption generation models adopt encoder-decoder framework.In the encoder stage,the convolutional neural networks are used to deal with the video.In the decoder stage,the LSTM are used to generate the caption of video.Based on the temporal correlation and multi-modality of video,this paper proposed a hybrid model,which was a multi-modal video caption generation model based on hard attention.In the encoder stage,it utilized the model of multi-modal fusion,which could make the two kinds of characteristics resonated to generated the final feature outputs.It used LSTM with hard attention mechanism in the decoder state to generate a description of video.The machine translation index obtained by this hybrid model on MSR-VTT is 0.2%~3.8%higher than the basic model.Experimental results show that the multi-modal fusion based on hard attention mechanism can generate accurate description captions of video.
作者
郭宁宁
蒋林华
Guo Ningning;Jiang Linhua(School of Optical-Electrical&Computer Engineering,University of Shanghai for Science&Technology,Shanghai 200093,China)
出处
《计算机应用研究》
CSCD
北大核心
2021年第3期956-960,共5页
Application Research of Computers