摘要
电力行业内部会积累规模可观的电力业务数据,自动挖掘电力业务数据中的信息对提升相关部门业务能力、降低电力行业内巨大运维成本有促进作用。但电力业务数据大多是非结构化数据且体量庞大繁杂,因此针对如何将电力业务数据中非结构化文本提取出结构化信息这一问题,设计了基于Transformer模型的融合词汇和二元词组特征的命名实体识别模型。在该模型中,通过使用融合多特征的BERT预训练语言模型得到词嵌入表示,并使用Transformer模型和条件随机场作为编码层和解码层,实现电网领域的命名实体识别。通过在电网领域文本的实验表明,所提出的模型在实体类型识别的准确率为93.62%,性能优于传统的命名实体识别方法,通过消融实验证明了该命名实体识别方法的有效性。
The power industry tends to accumulate large-scale power business data.Automatic mining of information in power business data can promote the business capacity of relevant departments and reduce the huge operation and maintenance cost in the power industry.But most of the power business data is unstructured,huge and complicated.Therefore,this paper aims at how to extract structured information from the unstructured text in the power business data,designs a named entity recognition model with fused vocabulary and binary phrase features based on the Transformer model.In this model,the word embedding representation is obtained by using the BERT pre-training language model fused with multiple features.The Transformer model and the conditional random field are used as the encoding layer and the decoding layer to realize the named entity recognition in the power grid field.Experiments on texts in the power grid field show that the model proposed in this paper has an accuracy of 93.62% in entity type recognition,which is better than traditional named entity recognition methods.Ablation experiments show the effectiveness of the named entity recognition method proposed in this paper.
作者
李妍
孟洁
何金
张旭
王梓蒴
LI Yan;MENG Jie;HE Jin;ZHANG Xu;WANG Zishuo(Information and Communication Company,State Grid Tianjin Electric Power Company,Tianjin 300010,China;Key Laboratory of Energy Big Data Simulation of Tianjin Enterprise,Tianjin 300010,China)
出处
《电力信息与通信技术》
2022年第4期24-31,共8页
Electric Power Information and Communication Technology
基金
国家电网有限公司总部科技项目资助“面向电力业务的自然语言理解建模研究及应用”(KJ20-1-15)。