摘要
技术可以从冗长的原始视频中提取出关键帧或关键镜头,生成简明紧凑的视频摘要,在基本概括了视频主要内容的基础上极大地缩短用户浏览时间。针对目前视频摘要算法普遍忽略视频中的运动信息而导致摘要缺乏逻辑性和故事性的问题,提出了一种基于多模态特征融合的动态视频摘要算法(MFFSN),采用了有监督的编码器-解码器的网络框架。在编码端通过深度神经网络提取原始视频帧的多尺度空间特征和光流图像的多尺度运动特征,利用运动引导注意力模块(Motion Guided Attention,MGA)进行时空注意力建模,对空间特征和运动特征进行有机融合得到多模态特征;在解码阶段,采用自注意力机制关注数据中的显著特征,再通过回归网络得到帧重要性分数;最后根据背包算法选择关键镜头生成动态摘要。在Sum Me基准数据集上的实验结果证明提出的MFFSN摘要算法优于现有的同类视频摘要算法。
Video summarization technology can extract key frames or key shots from the long original video to generate a concise and compact video summary, which can greatly shorten the browsing time of users on the basis of summarizing the main content of the video. The current video summarization algorithms generally ignore the motion information in the video, which leads to the lack of logic and story in the summary.In order to solve this problem, a dynamic video summarization algorithm based on multi-modal feature fusion(MFFSN) is proposed in this paper.MFFSN adopts a supervised encoder-decoder framework.At the coding end, the multi-scale spatial features of the original video frame and the multi-scale motion features of the optical flow image are extracted by deep neural network. The motion guided attention(MGA)module is used to model the spatio-temporal attention, and the spatial features and the motion features are organically integrated to obtain the multi-modal features.At the decoding end, the self-attention mechanism is used to pay attention to the salient features in the data, and then the frame importance score is obtained by regression network. Finally, the key shots are selected to generate dynamic summaryaccording to the knapsack algorithm.
出处
《工业控制计算机》
2022年第10期81-84,共4页
Industrial Control Computer
关键词
视频摘要
多模态特征融合
光流
注意力机制
video summarization
multi-modal feature fusion
optical flow
attentional mechanism