期刊文献+

基于深度学习的文本中细粒度知识元抽取方法研究 被引量:35

Extracting Fine-grained Knowledge Units from Texts with Deep Learning
原文传递
导出
摘要 【目的】改进Bootstrapping方法,建立深度学习模型从文本中抽取多类型细粒度的知识元。【方法】利用搜索引擎和Elsevier关键词构建知识元词库;基于Bootstrapping技术自动构建大规模的标注语料库,利用知识元评分模型和模式评分模型控制标注的质量;基于已标注多类型知识元的语料库训练LSTM-CRF模型,从文本中抽取新的知识元。【结果】基于17 756篇ACL论文摘要抽取"研究范畴"、"研究方法"、"实验数据"、"评价指标及取值"这4种知识元,其人工评价平均正确率为91%。【局限】模型参数的预设与调整需要人工参与,未对不同领域文本进行适用性验证。【结论】引入知识元与模式的评分模型,能够有效缓解"语义漂移"问题;基于深度学习模型抽取知识元实现快速且正确率高,为情报大数据智能分析提供了一种高效可靠的数据获取手段。 [Objective] This paper tries to extract fine-grained knowledge units from texts with a deep learning model based on the modified bootstrapping method. [Methods] First, we built the lexicon for each type of knowledge unit with the help of search engine and keywords from Elsevier. Second, we created a large annotated corpus based on the bootstrapping method. Third, we controlled the quality of annotation with the estimation models of patterns and knowledge units. Finally, we trained the proposed LSTM-CRF model with the annotated corpus, and extracted new knowledge units from texts. [Results] We retrieved four types of knowledge units(study scope, research method, experimental data, as well as evaluation criteria and their values) from 17,756 ACL papers. The average precision was 91%, which was calculated manually. [Limitations] The parameters of models were pre-defined and modified by human. More research is needed to evaluate the performance of this method with texts from other domains. [Conclusions] The proposed model effectively addresses the issue of semantic drifting. It could extract knowledge units precisely, which is an effective solution for the big data acquisition process of intelligence analysis.
作者 余丽 钱力 付常雷 赵华茗 Yu Li;Qian Li;Fu Changlei;Zhao Huaming(National Science Library,Chinese Academy of Sciences,Beijing 100190,China;Department of Library,Information and Achieve Management,University of Chinese Academy of Sciences,Beijing 100190,China;State Key Laboratory of Resources and Environmental Information System,Beijing 100101,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2019年第1期38-45,共8页 Data Analysis and Knowledge Discovery
基金 国家自然科学基金项目"中文网络文本的地理实体语义关系标注与评价"(项目编号:41801320) 国家社会科学基金项目"基于开放获取学术期刊的资源深度整合与揭示研究"(项目编号:16BTQ025) 中国科学院文献情报中心青年创新团队项目"基于机器学习的科研指纹识别方法研究"(项目编号:馆1724)的研究成果之一
关键词 知识元抽取 命名实体识别 深度学习 BOOTSTRAPPING LSTM-CRF Knowledge Unit Extraction Named Entity Recognition Deep Learning Bootstrapping LSTM-CRF
  • 相关文献

参考文献13

二级参考文献178

共引文献213

同被引文献533

引证文献35

二级引证文献146

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部