期刊文献+

视觉问答中问题处理算法研究 被引量:2

Study on Question Processing Algorithms in Visual Question Answering
下载PDF
导出
摘要 当前对视觉问答(Visual Question Answering,VQA)建模的研究多种多样,但现有的VQA模型有一个共同的缺点:训练和推理较为耗时。研究表明,VQA模型中文本处理部分主要基于长短期记忆网络(Long Short Term Memory,LSTM),而VQA模型的整体性能也受制于文本处理部分的LSTM网络,由于LSTM网络具有循环的特性,LSTM网络中复杂的数据流难以有效利用GPU的并行计算优势来加速计算。针对以上问题,以优化模型的训练速度为目的,提出了一个新模型SCMP(Simple Conv1d MaxPool1d)来代替LSTM网络处理输入模型的自然语言文本。在VQA2.0数据集上的实验结果表明,该模型与现有的模型相比训练速度提高了10倍,并且没有对VQA模型的精度造成损失。此外,文中提出了一种新颖的方法来对VQA2.0数据集中的文本数据进行数据增强。实验结果表明,数据增强可以提高VQA模型的精度,同时加速模型收敛,使用增强后的数据训练的模型(SCMP)在验证集上的评估分数为63.46%,优于目前现存的VQA模型。 At present,there are various researches on the modeling of Visual Question Answering(VQA)tasks,but existing VQA models have a common drawback,i.e.training and reasoning are time-consuming.Research shows that the text processing part of the VQA model is mainly based on LSTM(Long Short Term Memory)networks,and the overall performance of the VQA model is also limited by the LSTM network used for the text processing.Due to the recurrent nature of the LSTM network,the complex data streams in the LSTM network can hardly take advantages of GPU parallel computing to accelerate.Aiming at the above problems,and for the purpose of optimizing the training speed of the model,a new model named SCMP(Simple Conv1d MaxPool1d)is proposed in this paper to replace the LSTM network to deal with incoming natural language questions.The experimental results on the VQA2.0 dataset show that the training speed of the model is 10 times faster than the existing model,and there is no loss for the accuracy of the VQA model.In addition,this paper proposes a novel method for data augmentation of question datasets in VQA2.0 datasets.Experimental results show that data augmentation can improve model prediction performance and accelerate model convergence.The model trained with enhanced data(SCMP)obtains an evaluation score of 63.46%on the validation set,which is better than the existing VQA model.
作者 徐胜 祝永新 XU Sheng;ZHU Yong-xin(Shanghai Advanced Research Institute,Chinese Academy of Sciences,Shanghai 201210,China;University of Chinese Academy of Science,Beijing 100049,China)
出处 《计算机科学》 CSCD 北大核心 2020年第11期226-230,共5页 Computer Science
基金 国家自然科学基金(U1831118) 中国科学院战略性先导科技专项(XDA19000000,XDA19090106) 上海市科学技术委员会科研计划项目(18511103502)。
关键词 视觉问答 自然语言处理 卷积神经网络 长短期记忆网络 词嵌入 Visual question answering Natural language processing CNN LSTM Word embedding
  • 相关文献

参考文献1

二级参考文献1

共引文献5

同被引文献14

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部