摘要
命名实体识别是自然语言处理中重要而基础的任务。中国政府公文是一类影响深远的数据资源,其中蕴含的命名实体也与通用领域的实体有所不同。深度学习为这一特定领域的实体识别提供了技术支持,但是它们都需要大规模、高成本的标注语料,而且大都止于粗粒度的识别。本文重新界定了信息处理用公文实体的类别,做了细粒度的语料标注。然后分别使用主动学习和远程监督方法优化了实体识别模型。实验证明此方法识别的公文实体不仅粒度更细,识别F1值在87%以上,而且降低了语料需求,减小了约60%的语料标注工作量。
Named entity recognition is an important and fundamental task in natural language processing.Chinese government documents are profound data resource,and the named entities are also different from those in the general field.Deep learning techniques provide support for entity recognition in this particular field,but most of them are coarse-grained recognition,which all need masses of data and labor costs.In order to address the problems,this study redefines the entities and labels fined-grained official documents for information processing.Then we use active learning and distant supervision methods to optimize the entity recognition model.The test shows that the granularity of identified entities is finer,the F1 score is above 87%,and the workload of corpus labeling is reduced by 60%.
作者
俞敬松
吴聪
曹喜信
YU Jingsong;WU Cong;CAO Xixin(School of Software&Microelectronics,Peking University,Beijing 100871,China)
出处
《微纳电子与智能制造》
2020年第3期23-29,共7页
Micro/nano Electronics and Intelligent Manufacturing
基金
类脑视觉处理技术基金(YBN2018085207)项目资助。
关键词
命名实体识别
主动学习
预训练语言模型
政府公文
远程监督
named entity recognition
active learning
pre-trained language models
government documents
distant supervision