摘要
随着深度学习的兴起和不断发展,视觉问答领域的研究取得了显著进展,当前较多视觉问答模型通过引入注意力机制和相关迭代操作来提取图像区域与高频疑问词对的相关性,但在获取图像与问题的空间语义关联方面的有效性较低,从而影响答案的准确性.为此,提出一种基于MobileNetV3网络及注意力特征融合的视觉问答模型,首先,为优化图像特征提取模块,引入MobileNetV3网络,并加入空间金字塔池化结构,在减少网络模型计算复杂度的同时保证模型准确率.此外,对输出分类器进行改进,将其中的特征融合方式使用基于注意力特征融合方式连接,提升问答的准确率.最后在公开数据集VQA 2.0上进行对比实验,结果表明文章所提模型与当前主流模型相比更具优越性.
With the rise and continuous development of deep learning,significant progress has been made in the field of visual question answering.At present,most visual question answering models introduce attention mechanisms and related iterative operations to extract the correlation between image regions and high-frequency question word pairs,but The effectiveness of obtaining the spatial semantic association between the image and the question is low,which affects the accuracy of the answer.For this reason,a visual question answering model based on MobileNetV3 network and attention feature fusion is proposed.First,in order to optimize the image feature extraction module,the MobileNetV3 network is introduced and the spatial pyramid pooling structure is added to reduce the computational complexity of the network model while ensuring Model accuracy rate.In addition,the output classifier is improved,and the feature fusion method among them is connected using the attention-based feature fusion method to improve the accuracy of the question and answer.Finally,a comparative experiment is conducted on the public data set VQA 2.0,and the results show that the model proposed in the article is more superior than the current mainstream model.
作者
李宽
张荣芬
刘宇红
鲁鑫鑫
LI Kuan;ZHANG Rongfen;LIU Yuhong;LU Xinxin(College of Big Data and Information Engineering,Guizhou University,Guiyang 550025,Guizhou,China)
出处
《微电子学与计算机》
2022年第4期83-90,共8页
Microelectronics & Computer
基金
贵州省科学技术基金资助项目(黔科合基础-ZK[2021]重点001)。