摘要
细粒度命名实体识别(Named Entity Recognition,NER)在审计领域扶贫文本中识别实体信息,对优化扶贫政策成效分析与评估至关重要。近年来,深度学习在细粒度NER任务中取得显著成效,但特定领域仍面临语料集匮乏、迁移学习中细粒度特征不兼容性加剧及数据不平衡等问题。针对这些问题,制定了细粒度扶贫审计实体标签体系,并构建了细粒度扶贫审计语料集(FG-PAudit-Corpus)以解决审计领域数据集匮乏的问题。提出了基于样本贡献度对抗迁移的细粒度实体识别模型(FGATSC),该模型做对抗迁移训练,提出将样本贡献度权重纳入迁移特征中以解决细粒度特征的不兼容问题。同时,针对源域高资源与扶贫审计领域低资源样本的不平衡,提出了平衡资源对抗鉴别器(BRAD)以降低这种影响。实验结果表明,FGATSC模型在FG-PAudit-Corpus上F1的值为75.83%,较基线模型提高了9.03%,较其他主流模型提升了4.01%~6.53%;在Resume数据集上进行泛化性验证,F1值较近几年的主流模型提高约0.14%~1.31%,达到了95.77%。综上,验证了FGATSC模型的有效性和泛化性。
Fine-grained named entity recognition(NER)identifies entity information in pro-poor texts in the auditing domain,which is crucial for optimising the analysis and evaluation of pro-poor policy effectiveness.In recent years,deep learning has achieved significant results in fine-grained NER tasks,but the specific domain still faces problems such as the lack of corpus set,the increasing incompatibility of fine-grained features in transfer learning,and data imbalance.To address these issues,we formulate a fine-grained pro-poor audit entity labelling system and construct a fine-grained pro-poor audit corpus(FG-PAudit-Corpus)to address the scarcity of datasets in the audit domain.A fine-grained entity recognition model(FGATSC)based on sample contribution against migration is proposed,which does the training against migration and proposes to incorporate the sample contribution weights into the migrated features to solve the incompatibility problem of fine-grained features.Meanwhile,for the imbalance between high resources in the source domain and low resource samples in the pro-poor audit domain,balanced resource adversarial discriminator(BRAD)is proposed to reduce this effect.Experimental results show that the F1 value of the FGATSC model on FG-PAudit-Corpus is 75.83%,which is improved by 9.03% compared with the baseline model,and 4.01% to 6.53%compared with the other mainstream models.For the generalisation validation on the Resume dataset,the F1 is improved by about 0.14% to 1.31% compared with the mainstream models in recent years,and reaches 95.77%.In summary,the validity and generali-zability of the FGATSC model are verified.
作者
庞博文
陈一飞
黄佳
PANG Bowen;CHEN Yifei;HUANG Jia(School of Computer Science,Nanjing Audit University,Nanjing 211815,China)
出处
《计算机科学》
CSCD
北大核心
2024年第S02期136-143,共8页
Computer Science
关键词
细粒度实体识别
扶贫审计
对抗训练
样本贡献度
平衡资源
Fine-grained entity recognition
Pro-poor auditing
Adversarial training
Sample contribution
Balancing resources