期刊文献+

融合场景因素的视频内容理解模型

Video caption model with scene factors
下载PDF
导出
摘要 提出了一种融合场景因素的视频内容理解模型,首先通过ResNet提取全局特征,并结合迁移学习的Places365-CNNs提取深度场景特征;然后经由多层感知器生成相应的场景向量,并将其作为LSTM网络结构的输入,对视频中图像及其描述语句进行编码-解码处理;最后通过MSCOCO数据集预训练,为视频中的关键帧生成准确、具体的描述语句,使观众了解视频详细内容。将所提出模型在Flickr8K、Flickr30K和MSCOCO数据集以及视频《第三极》中进行训练和测试,并且使用不同的评估方法进行验证,结果表明输出语句对视频的描述较为准确,所提模型与其他现有模型相比性能有所提高。 A video content understanding model is proposed to fuse the scene factors.First,the global feature is extracted by ResNet,and the depth of the scene is extracted with the Places365-CNNs of the migrating learning.Then,the corresponding scene vector is generated by the multilayer perceptron,and is used as the input of the LSTM network structure for the images and video.The description language is conducted encoding-decoding processing.Finally,the MSCOCO dataset is pre-trained to generate accurate and specific description for the key frames in the video,so that the audience can understand the detailed content of the video.The proposed model is trained and tested on Flickr8K,Flickr30K and MSCOCO datasets and video‘Roof of the World’.Different evaluation methods are used to verify the model.The results show that the description of the output statement is more accurate,and the performance of the proposed model is improved compared with the existing models.
作者 彭玉青 刘璇 王纬华 赵晓松 魏铭 PENG Yuqing;LIU Xuan;WANG Weihua;ZHAO Xiaosong;WEI Ming(School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China;Hebei Provincial Key Laboratory of Big Data Computing,Hebei University of Technology,Tianjin 300401,China)
出处 《中国科技论文》 CAS 北大核心 2018年第14期1584-1589,共6页 China Sciencepaper
基金 河北省教育厅青年基金资助项目(QN2017314) 河北省自然基金重点项目(F2016202144)
关键词 视频内容理解 深度神经网络 语义信息 卷积神经网络 循环神经网络 video caption deep neural network semantic information convolution neural network recurrent neural network
  • 相关文献

参考文献4

二级参考文献13

共引文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部