摘要
配网系统存储着大量闲置的设备缺陷文本,可采用命名实体识别技术对其进行挖掘和利用。针对目前电力设备缺陷文本数据人工标注效率低,且专业领域实体识别困难的问题,提出一种新的标注策略和基于Bert-CRF(Bidirectional encoder representation from transformers-Conditional Random Fields)的命名实体识别模型。利用基于半监督学习的BIO(Begin、Internal、Other)标注,减少人工标注占比,提升标注速率,接着利用Bert预训练模型得到包含丰富语义信息的动态词向量,最后利用CRF层对标签进行约束。所提模型在自制配网一次设备缺陷文本数据集上进行了对比试验,该数据集包含9186条文本数据,12个大类25个小类。实验结果表明,文中模型取得了很好的效果,精确率、召回率和F1值分别达到97.85%、97.36%、97.34%,验证了该模型优于其他5种模型。
The distribution network system stores a large number of idle equipment defect analysis reports,which can be mined and utilized by named entity recognition technology.In view of the low efficiency of manual annotation of text data of electrical equipment defects and the difficulty of entity recognition in professional fields,this paper proposes a new annotation strategy and a named entity recognition model based on Bert-CRF(Bidirectional encoder representation from transformers-Conditional Random Fields).Use BIO(Begin,Internal,Other)annotation based on semi-supervised learning to reduce the proportion of manual annotation and improve the annotation rate,then use the Bert pre-training model to obtain dynamic word vectors containing rich semantic information,and finally use the CRF layer to constrain the labels.The proposed model is tested on the self-made distribution network primary equipment defect text dataset,which contains 9186 text data,12categories and 25subcategories.The experimental results show that the model in this paper has achieved good results,with the precision rate,recall rate and F1 value reaching 97.85%,97.36%,and 97.34%,respectively,verifying that the model is better than the other five models.
作者
刘雨可
周申培
石英
杜家宝
LIU Yu-ke;ZHOU Shen-pei;SHI Ying;DU Jia-bao(School of Automation,Wuhan University of Technology,Wuhan 430070,China)
出处
《武汉理工大学学报》
CAS
2022年第10期93-101,共9页
Journal of Wuhan University of Technology
基金
国家自然科学基金(52105528)