摘要
【目的】为改善手动或简单的引文提取方法,提高引文内容分析效果,应精确抽取引文内容。【方法】将引文内容抽取任务具体分为引文句、引文上下文、引文元数据三部分,基于指代消解理论,利用机器学习和层次过滤法对引文上下文进行抽取。【结果】实验数据收集了顺序编码制的中文期刊文献,结果证实该方法抽取引文句并解析参考文献结果正确无误,识别引文上下文的F1值为0.780~0.849。【局限】缺乏中文科学引文语料资源,实验数据选择人工标注小规模数据集,跨域能力有限,不可避免存在文本领域依赖的缺陷。【结论】本研究能够优化和扩大引文内容分析的步骤和范围,为使用引文内容分析法的相关研究者提供参考。
[Objective] This paper aims to accurately extract scientific citations and their context data, which significantly improves the results of citation analysis. [Methods] We divided the citation extraction task into citation sentence extraction, citation context identification, and citation metadata. Then, we proposed a coreference resolution-based method to identify and extract scientific citation context. [Results] We examined our method with the Chinese sequential coding periodicals and extracted the citation sentences and references correctly. The F1 value for identifying the citation context was between 0.780 and 0.849. [Limitations] Due to the limits of Chinese scientific citation corpus and the small scale of experimental data, the proposed method might not work effectively in other fields. [Conclusions] Our study optimizes the steps of citation content analysis and enlarges data scope. It provides support for researchers of citation content analysis.
作者
谭荧
唐亦非
Tan Ying;Tang Yifei(School of Public Administration,Hubei University,Wuhan 430062,China;School of Information Management,Central China Normal University,Wuhan 430079,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2021年第8期25-33,共9页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金重大项目(项目编号:19ZDA345)的研究成果之一。
关键词
信息抽取
指代消解
引文内容
引文上下文
Information Extraction
Coreference Resolution
Citation Content
Citation Context