期刊文献+

基于深度学习的情报学理论及方法术语识别研究 被引量:20

A Study on Chinese Terminology Recognition of Theory and Method from Information Science:Based on Deep Learning
下载PDF
导出
摘要 理论、方法的研究是学科不断发展前行的动力,了解掌握学科领域当前理论及方法的应用、发展情况是一项十分重要的工作。本文利用命名实体识别任务的分支——术语识别,对情报学理论方法进行研究,通过采集我国近20年来情报学领域相关文献20000篇左右,应用深度学习模型——Bi-LSTM-CRFs进行大规模语料训练与测试,通过实验验证其可行性并探究各实验变量对模型效果的影响,以求最大限度提高模型识别的效果。实验结果表明,对于理论方法术语等复杂实体,基于词切分的语料识别效果要优于基于字切分的语料;术语实体的长度对于识别效果也有一定影响,术语长度过大时(字数≥6),识别效果下降明显;同时,训练语料量与识别效果呈正相关关系,语料量越大,识别效果越好;实体的类型和数量直接影响识别结果,具有明显构词特征的实体识别效果较好;在特征引入实验中发现除拼音特征外,词性、词长以及词向量特征均能够对F1值有所提高,其中词向量和词性特征的提升效果最为明显。 The study of theory and method is the driving force for the continuous development of any discipline.It is important to understand the application and development of the current theories and methods in the subject area.In this paper,terminology recognition which is a branch of the task of named entities is used to study the theoretical methods of information science.About 20000 articles in the field of information science in the past 20 years are collected,and as large-scale corpus to be trained and tested in Bi-LSTM-CRFs,a model of Deep Learning.The experiments verify the model’s feasibility and explore the impact of each experimental variable on the model’s effect,in order to maximize the effect of model recognition.The results show that for complex entities such as theoretical method terms,the corpus recognition based on word segmentation is better than the word segmentation-based corpus.The length of the term also has a certain influence on the recognition effect.When the length of the term is too long(word count≥6),the recognition effect is obviously reduced.At the same time,the training corpus quantity is positively correlated with the recognition effect.Larger corpus quantities lead to better recognition.The type and quantity of the entity directly affects the recognition result.The entity recognition with obvious word formation features is better.In the feature introduction experiment,in addition to the pinyin feature,the part of speech,the length of the word,and the feature of the word vector can improve the F1 value.The improvement of the word vector and the part of speech features are obvious.
作者 王昊 邓三鸿 苏新宁 官琴 Wang Hao;Deng Sanhong;Su Xinning;Guan Qin(School of Information Management,Nanjing University,Nanjing 210023;Jiangsu Key Laboratory of Data Engineering and Knowledge Service,Nanjing 210093)
出处 《情报学报》 CSSCI CSCD 北大核心 2020年第8期817-828,共12页 Journal of the China Society for Scientific and Technical Information
基金 国家社会科学基金重大招标项目“情报学学科建设与情报工作未来发展路径研究”(17ZDA291) “江苏青年社科英才”人才培养项目 “南京大学仲英青年学者”人才培养项目。
关键词 情报学 术语识别 深度学习 Bi-LSTM-CRFs模型 information science terminology recognition deep learning Bi-LSTM-CRFs model
  • 相关文献

参考文献20

二级参考文献330

共引文献567

同被引文献371

引证文献20

二级引证文献86

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部