基于混合训练与语义关联的视频描述算法

Video captioning algorithm based on mixed training and semantic association

导出

摘要针对目前主流方法在使用Transformer的自注意力基础单元或长短期记忆(LSTM)单元对序列词的依赖性进行建模中,忽略了句子中词与词之间的语义关系和训练与测试阶段的曝光偏差问题,提出混合训练与语义关联的视频描述算法(DC-RL).在编码器部分,采用双向长短期记忆循环神经网络(LSTM1)融合经过预训练模型得到的外观特征和动作特征;在解码器阶段,使用注意力机制动态地为全局语义解码器和自学习解码器提取与当前生成单词对应的视觉特征,缓解了由传统的全局语义解码器中的训练和测试之间的差异引起的曝光偏差问题.全局语义解码器使用真实描述中上一个时间步的单词来驱动当前单词的生成,并通过全局语义提取器提取与当前单词对应的全局语义信息辅助当前单词的生成.自学习解码器使用上一个时间步生成的单词的语义信息来驱动当前单词的生成.混合训练的融合网络运用强化学习的方式直接优化融合网络模型,运用先前词的语义信息,生成更加准确的视频描述.研究结果表明:在数据集MSR-VTT上,融合网络模型较baseline在B4,M,R和C四个指标上分别提升2.3%,0.3%,1.0%和1.9%,而使用强化学习优化的融合网络模型分别提升2.0%,0.5%,1.9%和6.1%. Aiming at the problem that the current mainstream methods used Transformer's self-attention base unit or long short-term memory(LSTM)unit to model the dependency of sequence words,which ignored the semantic relationship between words in the sentence and the problem of exposure bias in the training and testing phases,a video captioning algorithm hybridizing the training and semantic correlation(DC-RL)was proposed.In the encoder section,a bi-directional long short-term memory recurrent neural network(LSTM1)was used to fuse the appearance features and action features obtained from the pre-trained model.In the decoder stage,an attentional mechanism was used to dynamically extract visual features corresponding to the currently generated word for both the global semantic decoder and the self-learning decoder,alleviating the problem of exposure bias caused by the discrepancy between training and testing in the traditional global semantic decoder.In this case,the global semantic decoder used the words from the previous time step in the real description to drive the generation of the current word,and in addition,the global semantic information corresponding to the current word was extracted by the global semantic extractor to assist the generation of the current word.The self-learning decoder,on the other hand,used the semantic information of the word generated at the previous time step to drive the generation of the current word.The hybrid-trained fusion network used reinforcement learning to directly optimize the fusion network model by using the semantic information of the previous word,which enabled the generation of more accurate video captioning.Research results show that on the dataset MSR-VTT,the fusion network model improves over the baseline in the four metrics of B4,M,R and C by 2.3%,0.3%,1.0%and 1.9%,respectively,and the fusion network model optimized by using reinforcement learning improves by 2.0%,0.5%,1.9%and 6.1%,respectively.

作者陈淑琴钟忺黄文心卢炎生 CHEN Shuqin;ZHONG Xian;HUANG Wenxin;LU Yansheng(School of Computer Science and Artificial Intelligence,Wuhan University of Technology,Wuhan 430070,China;School of Computer Science,Hubei University of Education,Wuhan 430205,China;School of Information Science and Technology,Peking University,Beijing 100091,China;School of Computer Science and Information Engineering,Hubei University,Wuhan 430062,China;School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China)

机构地区武汉理工大学计算机与人工智能学院湖北第二师范学院计算机学院北京大学信息科学技术学院湖北大学计算机与信息工程学院华中科技大学计算机科学与技术学院

出处《华中科技大学学报（自然科学版）》 EI CAS CSCD 北大核心 2023年第11期67-74,共8页 Journal of Huazhong University of Science and Technology(Natural Science Edition)

基金国家自然科学基金资助项目(62271361) 湖北省自然科学基金资助项目(2023AFB206,2021CFB513,2021CFB281) 湖北省教育厅科学研究重点项目(D20213002)。

关键词视频描述上下文语义双流解码器混合训练曝光偏差 video captioning contextual semantics dual stream decoder mixed training exposure bias

分类号 TP391 [自动化与计算机技术—计算机应用技术]