基于自适应帧采样算法和BLSTM的视频转文字研究被引量：1

Research on Video Description Based on Adaptive Frame Sampling Algorithm and Bidirectional Long Short-Term Memory

下载PDF

导出

摘要针对视频转文字(video to text)存在的建模复杂和准确率低的问题,提出了基于自适应帧采样算法和双向长短时记忆模型的视频转文字方法.自适应帧采样算法能够动态地调整采样率,以提供尽量多的特征来训练模型;结合双向长短时记忆模型,能有效学习视频中前面帧和未来帧的相关信息;同时,用于训练的特征是来自深度卷积神经网络的特征,使得这种双深度的网络结构能够学习视频帧在时空上的关联表示及全局依赖信息;帧信息的融合又增加了特征的种类,从而提升了实验效果.结果显示,在M-VAD和MPIIMD两个数据集中,文中的方法在METEOR中的评分均值分别为7.8%和8.6%,相对原S2VT模型分别提高了16.4%和21.1%,也提升了视频转文字的语言效果. Focusing on the complexity and low accuracy of video to text,in this paper a new video-to-text method is proposed based on adaptive frame sampling algorithm and Bidirectional Long-Short Term Memory(BLSTM).The former is able to adjust the sampling rate dynamically to provide as many features as possible for training models;by combining the BLSTM model one can gain relevant information about the front and future frames efficiently.At the same time,the features employed for training are those extracted from deep Convolutional Neural Networks,which causes doubly deep networks structure to be able to obtain the spatial and temporal correlation description of the videos and the global dependency information from space and time domain.The fusion of frame information increases the categories of feature and the experimental efficiency is obtained.The results show that by using the datasets of M-VAD and MPII-MD,the proposed framework helps achieve the average scores of 7.8%and 8.6%in METEOR,respectively.Comparing to the original S2VT model,the proposed method outperforms 16.4%and 21.1%by average score and it also improves the description of the videos.

作者张荣锋宁培阳肖焕侯史景伦邱威 ZHANG Rongfeng;NING Peiyang ;XIAO Huanhou;SHI Jinglun;QIU Wei(School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China)

机构地区华南理工大学电子与信息学院

出处《华南理工大学学报（自然科学版）》 EI CAS CSCD 北大核心 2018年第1期103-111,共9页 Journal of South China University of Technology(Natural Science Edition)

基金国家自然科学基金资助项目(61671213) 广州市人体数据科学重点实验室资助项目(201605030011)~~

关键词视频转文字自适应帧采样双向长短时记忆模型深度卷积神经网络帧信息的融合 video to text adaptive frame sampling bidirectional LSTM deep convolutional neural networks fusion information of frames

分类号 TP391 [自动化与计算机技术—计算机应用技术]