摘要
现有的多模态情感分析方法大多都是采用不同的模型来提取特征,模型之间特征的提取都是独立的过程,并且由于不同模态之间天然跨度较大,导致后续模态也难以有效融合,为了充分利用多模态信息,实现更有效的模态交互。为此提出了SCA-CLIP框架,即基于对比语言-图像预训练(CLIP)的多模态情绪分析网络。我们使用基于CLIP的编码器从图像和文本中提取强相关的深度表示,之后利用设计的堆叠交叉注意机制对跨模态的信息进行充分的交互以及融合,并且整个模型中通过利用BERT的多头注意力机制来维护可学习向量序列来捕捉有效信息,最后对典型的情绪分析数据集进行了广泛的实验。结果表明,所提出的框架在挖掘多模态情绪分析的关键特征方面具有更好的能力,能实现比原来更好的性能,即在MVSA-Single和MVSA-Multiple上的整体准确率分别提高了2.51%和1.3%。
Most of the existing multi-modal sentiment analysis methods use different models to extract features,and the extraction of features between models are independent processes.Due to the large natural span between different modes,it is difficult to effectively integrate the subsequent modes.In order to make full use of multi-modal information and realize more efficient modal interaction,SCA-CLIP framework is proposed,which is a multi-modal sentiment analysis network based on contrastive language-image pre-training(CLIP).CLIP-based encoders are used to extract strongly correlated deep representations from images and texts,and then the designed stacked cross-attention mechanism is used to fully interact as well as fuse the cross-modal information,and a sequence of learnable vectors throughout the model is maintained to capture effective information by using the multi-headed attention mechanism of BERT,and finally extensive experiments on a typical sentiment analysis dataset are conducted.The results show that the proposed framework has better capability in mining the key features of multi-modal sentiment analysis and can achieve better performance than previous work,the overall accuracy on MVSA-Single,MVSA-Multiple is improved by 2.51%and 1.3%,respectively.
作者
汪召凯
叶勇
汪子文
Wang Zhaokai;Ye Yong;Wang Ziwen(School of Information and Artificial Intelligence,Anhui Agricultural University,Hefei,Anhui 230036,China)
出处
《黑龙江工业学院学报(综合版)》
2023年第11期97-104,共8页
Journal of Heilongjiang University of Technology(Comprehensive Edition)
基金
安徽省高等学校质量工程项目“服务安徽省新一代信息技术的产教融合模式创新项目”(项目编号:2022sdxx012)
安徽省教育厅高校省级人文社会科学研究重点项目(项目编号:2023AH050966)。