摘要
术语抽取是学术文献知识挖掘的关键技术,其研究目标是提升学术文献领域术语抽取的效率。目前术语抽取主要分为三类方法,即基于规则的方法、基于统计学的方法、基于监督学习的方法。首先,本文对术语抽取中的代表方法进行了实验对比研究,包括语言学、统计学(TF-IDF、C-value、基于KL散度的方法等)、CRF及Bi-LSTM方法;其次,针对目前学术文献中术语抽取缺乏大量的手工标注语料的问题,提出了应用于当前学术文献术语抽取的改进模型;最后,总结了实验发现并提出了现阶段学术文献术语抽取及语料标注的方案。
Term extraction from research articles is one of the key technologies in literature knowledge mining.The goal is to improve the efficiency of term extraction.Nowadays,term extraction can be classified into three categories,that is,rule-based method,statistical method and supervised learning method.Firstly,this paper carries out the comparative study on term extraction by experimental methods,including linguistic method,statistical method(TF-IDF,C-value,KL dispersion-based methods,etc.),CRF,and Bi-LSTM.Secondly,since lacking of massive corpus labeling by manual,therefore,this paper presents an improved model for term extraction task in academic literature.Finally,this article summarizes the experimental findings and proposes the methodologies of semantic entity recognition for the current stage.
作者
蒋婷
Jiang Ting(School of Information Engineering,Nanjing University of Finance and Economics,Nanjing,210046)
出处
《信息资源管理学报》
CSSCI
2021年第1期112-122,共11页
Journal of Information Resources Management
基金
国家自然科学基金青年项目(71904078)
江苏省自然科学基金(BK20190793)
江苏高校哲学社会科学研究基金(2018SJA0263)的研究成果之一。
关键词
语义网
学术文献
术语抽取
知识图谱
语料标注
概念学习
Semantic web
Research article
Term extraction
Knowledge graph
Corpus annotation
Concept learning