摘要
专利中的科技实体是指专利文献中富含科技信息的词汇,抽取专利中的科技实体对科研工作者提高科研效率、企业布局专利体系都至关重要。提出一种基于半监督学习框架与命名实体识别模型相结合的科技实体抽取方法,半监督学习能够利用无标记数据的优势弥补标注数据稀缺的缺陷,利用大量的专利语料在通用领域的BERT模型基础上进行预训练,得到适用于专利领域的BERT模型BERT-Patent,有效提升模型对专利中科技实体的抽取性能。在专利数据集上的实验结果表明,提出的方法在准确率、召回率、F1值指标上分别提高了6.37%、2.99%、4.63%;在人民日报数据集上准确率、召回率、F1值分别提高了2.87%、1.24%、2.07%。
Technological information contained in patent documents was in the form of vocabulary.These vocabulary was called patent technology entity.Extracting the entity accurately from the patent was crucial for scientists to improve the efficiency of scientific research,and for enterprises to deploy the patent system.A method of extracting scientific and technological entity was proposed based on semi-supervised learning framework and named entity recognition model.It took advantage of semi-supervised learning to make up for the insufficiency of annotated data.At the same time,BERT-Patent model was pre-trained from the generic BERT model over a large patent corpus,in order to improve the feature extraction performance effectively in patent context.The proposed method had superior performance in terms of accuracy,recall rate,and F1 measure;specifically,it was scored 6.37%,2.99%,and 4.63%higher respectively on the patent dataset,and 2.87%,1.24%,and 2.07%higher respectively on People′s Daily dataset.
作者
原之安
彭甫镕
谷波
钱宇华
YUAN Zhi′an;PENG Furong;GU Bo;QIAN Yuhua(Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan 030006, China;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China;School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China)
出处
《郑州大学学报(理学版)》
北大核心
2021年第4期61-68,共8页
Journal of Zhengzhou University:Natural Science Edition
基金
国家自然科学基金项目(61672332)
山西省重点研发计划项目(201903D421003)
山西省教育厅科技成果转化培育项目(2020CG001)。
关键词
科技实体
专利挖掘
数据稀缺
BERT
半监督学习
technology entity
patent mining
data scarcity
BERT
semi supervised learning