摘要
针对视频转文字(video to text)存在的建模复杂和准确率低的问题,提出了基于自适应帧采样算法和双向长短时记忆模型的视频转文字方法.自适应帧采样算法能够动态地调整采样率,以提供尽量多的特征来训练模型;结合双向长短时记忆模型,能有效学习视频中前面帧和未来帧的相关信息;同时,用于训练的特征是来自深度卷积神经网络的特征,使得这种双深度的网络结构能够学习视频帧在时空上的关联表示及全局依赖信息;帧信息的融合又增加了特征的种类,从而提升了实验效果.结果显示,在M-VAD和MPIIMD两个数据集中,文中的方法在METEOR中的评分均值分别为7.8%和8.6%,相对原S2VT模型分别提高了16.4%和21.1%,也提升了视频转文字的语言效果.
Focusing on the complexity and low accuracy of video to text,in this paper a new video-to-text method is proposed based on adaptive frame sampling algorithm and Bidirectional Long-Short Term Memory(BLSTM).The former is able to adjust the sampling rate dynamically to provide as many features as possible for training models;by combining the BLSTM model one can gain relevant information about the front and future frames efficiently.At the same time,the features employed for training are those extracted from deep Convolutional Neural Networks,which causes doubly deep networks structure to be able to obtain the spatial and temporal correlation description of the videos and the global dependency information from space and time domain.The fusion of frame information increases the categories of feature and the experimental efficiency is obtained.The results show that by using the datasets of M-VAD and MPII-MD,the proposed framework helps achieve the average scores of 7.8%and 8.6%in METEOR,respectively.Comparing to the original S2VT model,the proposed method outperforms 16.4%and 21.1%by average score and it also improves the description of the videos.
作者
张荣锋
宁培阳
肖焕侯
史景伦
邱威
ZHANG Rongfeng;NING Peiyang ;XIAO Huanhou;SHI Jinglun;QIU Wei(School of Electronic and Information Engineering,South China University of Technology,Guangzhou 510640,Guangdong,China)
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2018年第1期103-111,共9页
Journal of South China University of Technology(Natural Science Edition)
基金
国家自然科学基金资助项目(61671213)
广州市人体数据科学重点实验室资助项目(201605030011)~~