期刊文献+

面向标注数据稀缺专利文献的科技实体抽取 被引量:4

Technology Entity Extraction of Patent Literature with Limited Annotated Data
下载PDF
导出
摘要 专利中的科技实体是指专利文献中富含科技信息的词汇,抽取专利中的科技实体对科研工作者提高科研效率、企业布局专利体系都至关重要。提出一种基于半监督学习框架与命名实体识别模型相结合的科技实体抽取方法,半监督学习能够利用无标记数据的优势弥补标注数据稀缺的缺陷,利用大量的专利语料在通用领域的BERT模型基础上进行预训练,得到适用于专利领域的BERT模型BERT-Patent,有效提升模型对专利中科技实体的抽取性能。在专利数据集上的实验结果表明,提出的方法在准确率、召回率、F1值指标上分别提高了6.37%、2.99%、4.63%;在人民日报数据集上准确率、召回率、F1值分别提高了2.87%、1.24%、2.07%。 Technological information contained in patent documents was in the form of vocabulary.These vocabulary was called patent technology entity.Extracting the entity accurately from the patent was crucial for scientists to improve the efficiency of scientific research,and for enterprises to deploy the patent system.A method of extracting scientific and technological entity was proposed based on semi-supervised learning framework and named entity recognition model.It took advantage of semi-supervised learning to make up for the insufficiency of annotated data.At the same time,BERT-Patent model was pre-trained from the generic BERT model over a large patent corpus,in order to improve the feature extraction performance effectively in patent context.The proposed method had superior performance in terms of accuracy,recall rate,and F1 measure;specifically,it was scored 6.37%,2.99%,and 4.63%higher respectively on the patent dataset,and 2.87%,1.24%,and 2.07%higher respectively on People′s Daily dataset.
作者 原之安 彭甫镕 谷波 钱宇华 YUAN Zhi′an;PENG Furong;GU Bo;QIAN Yuhua(Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan 030006, China;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China;School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China)
出处 《郑州大学学报(理学版)》 北大核心 2021年第4期61-68,共8页 Journal of Zhengzhou University:Natural Science Edition
基金 国家自然科学基金项目(61672332) 山西省重点研发计划项目(201903D421003) 山西省教育厅科技成果转化培育项目(2020CG001)。
关键词 科技实体 专利挖掘 数据稀缺 BERT 半监督学习 technology entity patent mining data scarcity BERT semi supervised learning
  • 相关文献

参考文献6

二级参考文献51

  • 1张坤丽,赵旭,关同峰,尚柏羽,李羽蒙,昝红英.面向医疗文本的实体及关系标注平台的构建及应用[J].中文信息学报,2020,34(6):36-44. 被引量:15
  • 2昝红英,刘涛,牛常勇,赵悦淑,张坤丽,穗志方.面向儿科疾病的命名实体及实体关系标注语料库构建及应用[J].中文信息学报,2020,34(5):19-26. 被引量:18
  • 3俞鸿魁,张华平,刘群,吕学强,施水才.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报,2006,27(2):87-94. 被引量:160
  • 4Daniel Jurafsky,James H.Martin著,冯志伟,孙乐译.自然语言处理综论[M].北京:电子工业出版社,2005.
  • 5党倩娜.专利分析方法和主要指标[OL].[2007-03-10].http://www.istis.cn/istis.old/kjcy/cyfz/list.asp?id=2402.
  • 6Tri Tran Q,Thao Pham T X,Hung Ngo Q,et al.Named Entity Recognition in Vietnamese documents[J].Progress in Informatics,2007,4:5-13.
  • 7张晓艳,王挺,陈火旺.基于混合统计模型的汉语命名实体识别方法[J].中文信息学报,2009,(2).
  • 8Chen,Hsin-His,Yang Changhua & Ying Lin.Learning Formulation and Transformation Rules for Multilingual Named Entities[C]// Proceedings of ACL-2003.
  • 9Chieu,Hai leong & Hwee Tou Ng.Named Entity Recognition with a Maximum Entropy Approach[C]// Proceedings of CoNLL-2003.
  • 10Dat Bat Nguyen,Son Huu Hoang,Son Bao Pham & Thai Phuong Nguyen.Named Entity Recognition for Vietnamese[J].ACIIDS2010.Part Ⅱ,LNAI5991,pp.205-214.

共引文献70

同被引文献39

引证文献4

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部