摘要
为解决多模态命名实体识别中存在的视觉语义理解和多模态语义的偏差问题,本文提出了置信学习引导标签融合的多模态命名实体识别方法 .该方法调用BLIP-2预训练模型生成图像描述,将其与输入的文本拼接,进行图文联合编码实现多模态特征融合,对多模态表征和文本表征解码后得到候选标签和文本标签;在采用KL散度损失函数对齐两组标签的基础上,计算置信分数用来评估多模态表征质量,设置置信阈值辅助筛选出有偏差的候选标签,并使用相应位置的文本标签替换有偏差的候选标签,实现标签的融合,最终完成多模态命名实体识别.为了验证本文方法,在Twitter-2015和Twitter-2017多模态数据集上进行实验,并将实验结果与MSB、UMT等7种主流方法进行对比,实验结果证明了本文方法的有效性.
To solve the visual semantic understanding bias and multimodal semantic bias in multimodal named entity recognition,the confidence learning guides label fusion(CLGLF)method for multimodal named entity recognition is pro⁃posed.This method invokes the BLIP-2 pre-trained model to generate image captions,concatenates them with the input texts,and performs joint coding to achieve multimodal feature fusion.The candidate labels and text labels are obtained after decoding the multimodal representations and text representations.Based on using the KL divergence loss function to align the two groups of labels,the confidence score is calculated to evaluate the quality of the multimodal representation,and a confidence threshold is set to help screen out the biased candidate labels,the text labels in the corresponding positions are used to replace the biased candidate labels,to achieve the label fusion,and finally complete the multimodal named entity recognition.In order to verify the proposed method,experiments are carried out on the Twitter-2015 and Twitter-2017 mul⁃timodal datasets,and the experimental results are compared with 7 mainstream methods,such as MSB and UMT.The exper⁃imental results show the effectiveness of the CLGLF.
作者
王海荣
王彤
徐玺
荆博祥
陈芳萍
WANG Hai-rong;WANG Tong;XU Xi;JING Bo-xiang;CHEN Fang-ping(School of Computer Science and Engineering,North Minzu University,Yinchuan,Ningxia 750021,China;Laboratory of Image&Graphics Intelligent Processing of State Ethnic Affairs Commission,North Minzu University,Yinchuan,Ningxia 750021,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2024年第7期2429-2437,共9页
Acta Electronica Sinica
基金
宁夏自然科学基金(No.2023AAC03316)
北方民族大学研究生创新项目(No.YCX23159)~~。
关键词
多模态命名实体识别
图像描述
置信学习
多模态语义偏差
信息抽取
multimodal named entity recognition
image caption
confidence learning
multimodal semantic bias
in⁃formation extraction