摘要
针对视频自动描述任务中的复杂信息表征问题,提出一种多维度和多模态视觉特征的提取和融合方法。首先通过迁移学习提取视频序列的静态和动态等多维度特征,并采用图像描述算法提取视频关键帧的语义信息,完成视频信息的特征表征;然后采用多层长短期记忆网络融合多维度和多模态信息,最终生成视频内容的语言描述。实验仿真表明,所提方法与目前已有方法相比,在视频自动描述任务中取得了较好的效果。
In order to solve the problem of complex information representation in automatic video description tasks, a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed. Firstly, multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning, and the image description algorithm was also used to extract the semantic information of the key frames in the video. By doing this, the video features extraction was carried out. Then, multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information, and finally generated a language description of the video content. Compared with the existing methods, experimental simulations results show that the proposed method achieves better results in the video automatic description task.
作者
丁恩杰
刘忠育
刘亚峰
郁万里
DING Enjie;LIU Zhongyu;LIU Yafeng;YU Wanli(IoT/Perception Mine Research Center,China University of Mining&Technology,Xuzhou 221008,China;Institute of Electrodynamics and Microelectronics,University of Bremen,Bremen 28359,Germany)
出处
《通信学报》
EI
CSCD
北大核心
2020年第2期36-43,共8页
Journal on Communications
基金
国家重点研发计划基金资助项目(No.2017YFC0804400,No.2017YFC0804401)~~
关键词
视频描述
多模态
迁移学习
长短期记忆网络
循环神经网络
video description
multimodal
transfer learning
long and short term memory network
recurrent neural network