基于对比学习的视觉增强多模态命名实体识别

Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning

下载PDF

导出

摘要多模态命名实体识别(MNER)的目的是在给定的图像-文本对中检测实体范围并将其分类为相应的实体类型。尽管现存的MNER方法取得了成功,但它们都集中在使用图像编码器提取视觉特征后,不做增强或过滤处理,直接送入跨模态交互机制。此外,由于文本和图像的表示来自不同的编码器,很难弥合两种模态之间的语义鸿沟,因此,提出了一个基于对比学习的视觉增强多模态命名实体识别模型(MCLAug)。首先,使用ResNet收集图像特征,在此基础上提出金字塔双向融合策略,将低层次高分辨率和高层次强语义的图像信息结合起来,以增强视觉特征。其次,利用CLIP模型中的多模态对比学习思想,计算并最小化对比损失,使两种模态的表示更加一致。最后,利用跨模态注意力机制和门控融合机制获得融合后的图像和文本表示,并通过CRF解码器来执行MNER任务。在两个公开数据集上进行了对比实验并进行消融研究和案例研究,结果证明了所提模型的有效性。 Multimodal named entity recognition(MNER)aims to detect ranges of entities in a given image-text pair and classifies them into corresponding entity types.Although existing MNER methods have achieved success,they all focus on using image encoder to extract visual features,without enhancement or filtering,and directly feed them into cross-modal interaction mechanism.Moreover,since the representations of text and images come from different encoders,it is difficult to bridge the semantic gap between the two modalities.Therefore,a vision-enhanced multimodal named entity recognition model based on contrastive learning(MCLAug)is proposed.First,ResNet is used to collect image features.On this basis,a pyramid bidirectional fusion strategy is proposed to combine low-level high-resolution with high-level strong semantic image information to enhance visual features.Se-condly,using the idea of multimodal contrastive learning in the CLIP model,calculate and minimize the contrastive loss to make the representations of the two modalities more consistent.Finally,the fused image and text representations are obtained using a cross-modal attention mechanism and a gated fusion mechanism,and a CRF decoder is used to perform the MNER task.Comparative experiments,ablation studies and case studies on 2 publicly datasets demonstrate the effectiveness of the proposed model.

作者于碧辉谭淑月魏靖烜孙林壮卜立平赵艺曼 YU Bihui;TAN Shuyue;WEI Jingxuan;SUN Linzhuang;BU Liping;ZHAO Yiman(University of Chinese Academy of Sciences,Beijing 100049,China;Shenyang Institute of Computing Technology,Chinese Academy of Sciences,Shenyang 110168,China)

机构地区中国科学院大学中国科学院沈阳计算技术研究所

出处《计算机科学》 CSCD 北大核心 2024年第6期198-205,共8页 Computer Science

基金辽宁省应用基础研究计划项目(2022JH2/101300258)。

关键词多模态命名实体识别 CLIP 多模态对比学习特征金字塔 TRANSFORMER 门控融合机制 Multimodal named entity recognition CLIP Multimodal contrastive learning Feature pyramid Transformer Gated fusion mechanism

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1李亮亮,任佳,吕志刚,王鹏,孙梦宇,李晓艳,高武奇.高灰阶图像增强算法及在X射线底片的应用与研究[J].计算机集成制造系统,2024,30(4):1309-1323.
2董学祎,宫义山.基于联合隐式特征的多模态情感分析模型[J].长江信息通信,2024,37(5):54-57.
3李源凡,张丽红.基于CLIP模型和文本重建的人脸图像生成方法研究[J].测试技术学报,2024,38(2):154-160.
4罗克韦尔自动化携手英伟达拓宽AI在制造业中的应用规模和范围[J].变频器世界,2024,27(4):22-22.

计算机科学

2024年第6期

浏览历史

内容加载中请稍等...

基于对比学习的视觉增强多模态命名实体识别

相关作者

相关机构

相关主题

浏览历史