期刊文献+

基于半监督学习的域适应实体解析算法

Domain-adaptive Entity Resolution Algorithm Based on Semi-supervised Learning
下载PDF
导出
摘要 实体解析旨在查找两个数据实体是否引用同一实体,是许多自然语言处理任务中的一项基本任务。现有的基于深度学习的实体解析解决方案通常需要大量的标注数据,即使利用预训练的语言模型进行训练,仍然需要数千个标签才能达到令人满意的准确性。现实场景中,这些标注数据并不容易获得。针对上述问题,提出了一个基于半监督学习的域适应实体解析模型。首先,在源域上训练一个分类器,然后利用域适应减小源域和目标域的分布差异,同时用数据增强后的目标域软伪标签加入源域迭代训练,从而实现从源域到目标域的知识迁移。在13个来自相同或不同领域的数据集上对所提模型进行了对比实验和消融实验,实验结果表明,与无监督基线模型相比,所提模型在多个数据集上的F1值平均提升了2.84%,9.16%和7.1%;与有监督基线模型相比,所提模型只需要20%~40%的标签就可以达到与有监督学习相当的性能。消融实验进一步证明了所提模型的有效性,其总体上可以获得更好的实体解析结果(相关代码已开源1))。 Entity resolution is a fundamental task in many natural language processing tasks,which aims to find out whether two data entities refer to the same entity.Existing deep learning-based solutions for entity resolution typically require a large amount of annotated data,even when pre-trained language models are used for training.Obtaining such annotated data is challenging in real-world scenarios.To address this issue,a domain-adaptive entity resolution model based on semi-supervised learning is proposed.First,a classifier is trained on the source domain,and then domain adaptation is used to reduce the distributional difference between the source and target domains.Soft pseudo-labels from the augmented target domain are then added to the source domain for iterative training,enabling knowledge transfer from the source to the target domain.Comparison and ablation experiments are performed on 13 datasets from various domains.The results show that,compared to unsupervised baseline models,the proposed model achieves an average F1 score improvement of 2.84%,9.16%,and 7.1%across multiple datasets.Compared to supervised baseline models,it achieves comparable performance with only 20%to 40%of the labels required.Ablation experiments further demonstrate the effectiveness of the proposed model,and better entity resolution results can be obtained in general(The relevant code is available 1)).
作者 戴超凡 丁华华 DAI Chaofan;DING Huahua(National Key Laboratory of Information Systems Engineering,National University of Defense Technology,Changsha 410073,China)
出处 《计算机科学》 CSCD 北大核心 2024年第9期214-222,共9页 Computer Science
关键词 实体解析 域适应 伪标签 预训练语言模型 数据增强 Entity resolution Domain adaptation Pseudo-labels Pre-trained language model Data augmentation
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部