摘要
为克服电力科技文本专业化、跨学科特点给知识获取带来的挑战,提出构建电力科技领域语言模型,实现更准确的文本表示。文章收集大量电力科技论文、专利、项目等文本,基于Transformer模型预训练得到领域语言模型,设计电力科技术语分类和电力科技远程监督实体关系抽取2类知识抽取任务进行模型验证,实验结果表明,所提领域语言模型在术语分类任务上的F1分数较word2vec基准模型提升超过10%,在实体关系抽取任务上的AUC分数比BERT语言模型基准模型提升约2%,所提模型有利于为下游知识获取任务提供更高质量特征表示。
To overcome the challenges of knowledge acquisition brought by the specialization and interdisciplinary characteristics of electric power science and technology texts,a power technology language model is proposed to achieve a more accurate text representation.The Transformer-based language model is pre-trained on large-scale power technology papers,patents,projects,and other texts.Two tasks including power science and technology term classification and distantly supervised entity relation extraction are proposed for verifying the model.Experiment results show that the F1-score of the proposed domain language model on the term classification task is more than 10%higher than that of the word2vec benchmark model,and the AUC score on the entity relation extraction task is about 2%higher than the BERT benchmark model.The proposed language model is beneficial to provide higher-quality feature representations for downstream knowledge acquisition tasks.
作者
徐翀
王其清
XU Chong;WANG Qiqing(State Grid Energy Research Institute Co.,Ltd.,Changping District,Beijing 102209,China)
出处
《电力信息与通信技术》
2023年第4期31-36,共6页
Electric Power Information and Communication Technology
基金
国家电网有限公司总部科技项目资助“基于知识图谱的科技咨询专家智能优选技术研究与开发”(1400-202057269A-0-0-00)。
关键词
电力科技
知识获取
语言模型
自然语言处理
electric power technology
knowledge acquisition
language model
natural language processing