期刊文献+

基于对比学习的视觉增强多模态命名实体识别

Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning
下载PDF
导出
摘要 多模态命名实体识别(MNER)的目的是在给定的图像-文本对中检测实体范围并将其分类为相应的实体类型。尽管现存的MNER方法取得了成功,但它们都集中在使用图像编码器提取视觉特征后,不做增强或过滤处理,直接送入跨模态交互机制。此外,由于文本和图像的表示来自不同的编码器,很难弥合两种模态之间的语义鸿沟,因此,提出了一个基于对比学习的视觉增强多模态命名实体识别模型(MCLAug)。首先,使用ResNet收集图像特征,在此基础上提出金字塔双向融合策略,将低层次高分辨率和高层次强语义的图像信息结合起来,以增强视觉特征。其次,利用CLIP模型中的多模态对比学习思想,计算并最小化对比损失,使两种模态的表示更加一致。最后,利用跨模态注意力机制和门控融合机制获得融合后的图像和文本表示,并通过CRF解码器来执行MNER任务。在两个公开数据集上进行了对比实验并进行消融研究和案例研究,结果证明了所提模型的有效性。 Multimodal named entity recognition(MNER)aims to detect ranges of entities in a given image-text pair and classifies them into corresponding entity types.Although existing MNER methods have achieved success,they all focus on using image encoder to extract visual features,without enhancement or filtering,and directly feed them into cross-modal interaction mechanism.Moreover,since the representations of text and images come from different encoders,it is difficult to bridge the semantic gap between the two modalities.Therefore,a vision-enhanced multimodal named entity recognition model based on contrastive learning(MCLAug)is proposed.First,ResNet is used to collect image features.On this basis,a pyramid bidirectional fusion strategy is proposed to combine low-level high-resolution with high-level strong semantic image information to enhance visual features.Se-condly,using the idea of multimodal contrastive learning in the CLIP model,calculate and minimize the contrastive loss to make the representations of the two modalities more consistent.Finally,the fused image and text representations are obtained using a cross-modal attention mechanism and a gated fusion mechanism,and a CRF decoder is used to perform the MNER task.Comparative experiments,ablation studies and case studies on 2 publicly datasets demonstrate the effectiveness of the proposed model.
作者 于碧辉 谭淑月 魏靖烜 孙林壮 卜立平 赵艺曼 YU Bihui;TAN Shuyue;WEI Jingxuan;SUN Linzhuang;BU Liping;ZHAO Yiman(University of Chinese Academy of Sciences,Beijing 100049,China;Shenyang Institute of Computing Technology,Chinese Academy of Sciences,Shenyang 110168,China)
出处 《计算机科学》 CSCD 北大核心 2024年第6期198-205,共8页 Computer Science
基金 辽宁省应用基础研究计划项目(2022JH2/101300258)。
关键词 多模态命名实体识别 CLIP 多模态对比学习 特征金字塔 TRANSFORMER 门控融合机制 Multimodal named entity recognition CLIP Multimodal contrastive learning Feature pyramid Transformer Gated fusion mechanism
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部