项目申请书摘要文本的语步识别语料构建被引量：1

The Construction of Move Recognition Corpus for Project Application Abstract

导出

摘要 [目的/意义]自动识别项目申请书摘要中的科学要素,对于揭示科技项目中的科学知识具有重要的研究意义。这些科学要素的识别依赖于结构化项目摘要文本,然而目前结构化项目摘要语料资源匮乏,严重制约着相关研究的进一步发展。拟构建项目申请书摘要文本的语步语料集,为相关研究提供数据支撑。[方法/过程]首先将项目摘要内容归纳为背景及问题、目标及任务、方法内容、价值意义4种语步类型,总结每个语步结构中出现的标志性特征并制定语步标注规范;其次相继利用基于规则和基于深度学习的方法辅助人工进行项目摘要的语步结构标注,并对每轮标注后的语料进行质量评估。[结果/结论]两种方法共计标注近25000条语句,语料标注的一致性系数达到0.9839,表明该语料集基本能够区分项目摘要内的不同语步结构,初步达到了语料库建设的基本要求。 [Purpose/Significance]Automatic recognition of scientific elements in project application abstracts is of great research significance for revealing scientific knowledge in science and technology projects.The recognition of these scientific elements relies on structured project abstract texts.However,the current lack of structured corpus resources for project abstract seriously restricts the further development of related research.Therefore,this paper intends to construct a move corpus of the project application abstract to provide data support for related research.[Method/Process]First,the project abstracts were summarized into four types of moves:background and problem,objective and task,methodological content,value and significance,then this paper summarized the iconic features that appear in the structure of each move and formulate a move annotation specification.Second,this study successively used rule-based and deep learning-based methods to assist in manual move structure annotation of project abstracts,and evaluate the quality of each round of annotated corpus.[Result/Conclusion]The two methods have annotated nearly 25,000 sentences,and the consistency coefficient of the corpus annotation reached 0.9839,which indicating that the corpus can basically distinguish different move structures among project abstracts and initially meet the basic requirements for corpus construction.

作者赵旸张智雄李婕 Zhao Yang;Zhang Zhixiong;Li Jie(National Science Library,Chinese Academy of Sciences,Beijing 100190;Department of Library,Information and Archives Management,School of Economics and Management,University of Chinese Academy of Sciences,Beijing 100190)

机构地区中国科学院文献情报中心中国科学院大学经济与管理学院图书情报与档案管理系

出处《图书情报工作》 CSSCI 北大核心 2022年第21期97-106,共10页 Library and Information Service

基金中国科学院文献情报能力建设专项子项目"基于科技文献知识的人工智能(AI)引擎建设"(项目编号:E0290906)研究成果之一。

关键词语步识别项目申请摘要文本语步语料集构建迭代标注 move recognition project application abstract move corpus construction iterative annotation

分类号 G202 [文化科学—传播学] G203 [文化科学—传播学]