摘要
针对多模态情感分析任务中模态内信息不完整、模态间交互能力差和难以训练的问题,将视觉语言预训练(VLP)模型应用于多模态情感分析领域,提出一种融合自监督和多层交叉注意力的多模态情感分析网络(MSSM)。通过自监督学习强化视觉编码器模块,并加入多层交叉注意力以更好地建模文本和视觉特征,使模态内部信息更丰富完整,同时使模态间的信息交互更充分。此外,通过具有感知意识的快速、内存效率高的精确注意力FlashAttention解决Transformer中注意力计算高复杂度的问题。实验结果表明,与目前主流的基于对比文本-图像对的模型(CLIP)相比,MSSM在处理后的MVSA-S数据集上的准确率提高3.6个百分点,在MVSA-M数据集上的准确率提高2.2个百分点,验证所提网络能在降低运算成本的同时有效提高多模态信息融合的完整性。
Aiming at the problems of incomplete intra-modal information,poor inter-modal interaction,and difficulty in training in multimodal sentiment analysis,a Multimodal Sentiment analysis network with Self-supervision and Multi-layer cross Attention fusion(MSSM)was proposed with Visual-and-Language Pre-training(VLP)model applied to the field of multimodal sentiment analysis.The visual encoder module was enhanced through self-supervised learning,and multi-layer cross attention was added to better model textual and visual features.Thus,the intra-modal information was made more abundant and complete,and the inter-modal information interaction was made more sufficient.Besides,the fast and memoryefficient exact attention with IO-awareness:FlashAttention was adopted in the proposed algorithm to address the high complexity of attention computation in Transformer.Experimental results show that compared with the current mainstream model Contrastive Language-Image Pre-training(CLIP),MSSM improves the accuracy by 3.6 percentage points on the processed MVSA-S dataset and 2.2 percentage points on MVSA-M dataset,proving that the proposed network can effectively improve the integrity of multimodal information fusion while reducing computational cost.
作者
薛凯鹏
徐涛
廖春节
XUE Kaipeng;XU Tao;LIAO Chunjie(Institute of China National Information Technology,Northwest Minzu University,Lanzhou Gansu 730030,China;Key Laboratory of Linguistic and Cultural Computing,Ministry of Education(Northwest Minzu University),Lanzhou Gansu 730030,China)
出处
《计算机应用》
CSCD
北大核心
2024年第8期2387-2392,共6页
journal of Computer Applications
基金
甘肃省高等学校青年博士基金资助项目(2022QB-016)
中央高校基本科研业务费专项(31920230069)
甘肃省青年科技计划项目(21JR1RA21)
国家档案局科技项目(2021-X-56)。
关键词
多模态
情感分析
自监督
注意力机制
视觉语言预训练模型
multimodal
sentiment analysis
self-supervision
attention mechanism
Visual-and-Language Pre-training(VLP)model