摘要
针对视频描述生成任务,提出一种基于多特征的描述生成算法。分析和提取视频片段的视觉信息和语音信息,经融合后形成丰富的视频语义表达。在基于长短期记忆网络构成的编解码框架中进行编解码操作,保留捕获各视频帧序列特征的时序结构,通过嵌入注意力机制,使得整个模型有能力关注重要信息出现的时刻,最终生成视频描述语言更加准确丰富。所提出的方法在公开的视频描述数据集MSR-VTT进行测试,测试各种特征模型生成描述语句的准确度,与基准方法相比较,取得了较好的效果。
A video captioning algorithm based on multi-feature is proposed for video captioning task. The visual feature and audio feature of video clips are analyzed and extracted to form a rich sematic expression of video after fusion. In the encoder-decoder network based on long short term memory, the model can capture the sequence structure of each video frame sequence feature. By embedding the attention mechanism, the entire model has the ability to focus on the moment when important information appears. Finally, video captioning language is accurate and rich. The proposed method is tested in the public video captioning dataset MSR-VTT to test the accuracy of each feature model. Compare with the benchmark method, a better result is obtained.
作者
曹磊
万旺根
侯丽
Cao Lei;Wan Wanggen;Hou Li(School of Communication and Information Engineering,Shanghai University,Shanghai 200072,China;Institute of Smart City,Shanghai University,Shanghai 200072,China;School of Information Engineering,Huangshan University,Huangshan 245041,China)
出处
《电子测量技术》
2020年第16期99-103,共5页
Electronic Measurement Technology
基金
上海市科委港澳台科技合作项目(18510760300)
安徽省自然科学基金项目(1908085MF178)
安徽省优秀青年人才支持计划项目(gxyqZD2019069)资助。
关键词
视频描述
多特征
长短期记忆网络
注意力机制
video captioning
multi-feature
long short term memory
attention mechanism