期刊文献+

基于空时变换网络的视频摘要生成 被引量:2

Video Summarization Based on Spacial-temporal Transform Network
下载PDF
导出
摘要 生成是计算机视觉领域必不可少的关键任务,这一任务的目标是通过选择视频内容中信息最丰富的部分来生成一段简洁又完整的视频摘要,从而对视频内容进行总结.所生成的视频摘要通常为一组有代表性的视频帧(如视频关键帧)或按时间顺序将关键视频片段缝合所形成的一个较短的视频.虽然视频摘要生成方法的研究已经取得了相当大的进展,但现有的方法存在缺乏时序信息和特征表示不完备的问题,很容易影响视频摘要的正确性和完整性.为了解决视频摘要生成问题,提出一种空时变换网络模型,该模型包括3大模块,分别为:嵌入层、特征变换与融合层、输出层.其中,嵌入层可同时嵌入空间特征和时序特征,特征变换与融合层可实现多模态特征的变换和融合,最后输出层通过分段预测和关键镜头选择完成视频摘要的生成.通过空间特征和时序特征的分别嵌入,以弥补现有模型对时序信息表示的不足;通过多模态特征的变换和融合,以解决特征表示不完备的问题.在两个基准数据集上做了充分的实验和分析,验证了所提模型的有效性. Video summarization is an indispensable and critical task in computer vision, the goal of which is to generate a concise and complete video summary by selecting the most informative part of a video. A generated video summary is a set of representative video frames(such as video keyframes) or a short video formed by stitching key video segments in time sequence. Although the study on video summarization has made considerable progress, the existing methods have the problems of deficient temporal information and incomplete feature representation, which can easily affect the correctness and completeness of a video summary. To solve the problems, this study proposes a model based on a spatiotemporal transform network, which includes three modules, i.e., the embedding layer, the feature transformation and fusion layer, and the output layer. Specifically, the embedding layer can simultaneously embed spatial and temporal features, and the feature transformation and fusion layer can realize the transformation and fusion of multi-modal features;finally, the output layer generates the video summary by segment prediction and key shot selection. The spatial and temporal features are embedded separately to fix the problem of deficient temporal information in existing models, and the transformation and fusion of multi-modal features can solve the problem of incomplete feature representation. Sufficient experiments and analyses on two benchmark datasets are conducted, and the results verify the effectiveness of the proposed model.
作者 李群 肖甫 张子屹 张锋 李延超 LI Qun;XIAO Fu;ZHANG Zi-Yi;ZHANG Feng;LI Yan-Chao(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处 《软件学报》 EI CSCD 北大核心 2022年第9期3195-3209,共15页 Journal of Software
基金 国家自然科学基金(61906099,61906098)。
关键词 视频摘要生成 空时变换网络 ViLBERT 特征融合 多模态 video summarization spacial-temporal transform network ViLBERT feature fusion multi-modal
  • 相关文献

同被引文献8

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部