摘要
高质量的标注数据是中文科技文献领域自然语言处理任务的重要基石。针对目前缺乏中文科技文献的高质量标注语料以及人工标注质量参差不齐且效率低下的问题,提出一种基于大语言模型的中文科技文献标注方法。首先,制定适用于多领域中文科技文献的细粒度标注规范,明确标注实体类型以及标注粒度;其次,设计结构化文本标注提示模板和生成解析器,将中文科技文献标注任务设置成单阶段单轮问答过程,将标注规范和带标注文本填充至提示模板中相应的槽位以构建任务提示词;然后,将提示词注入到大语言模型中生成包含标注信息的输出文本,经由解析器解析得到结构化的标注数据;最后,利用基于大语言模型的提示学习生成中文科技文献实体标注数据集ACSL,其中包含分布在48个学科的10000篇标注文档以及72536个标注实体,并在ACSL上提出基于RoBERTa-wwm-ext的3个基准模型。实验结果表明,BERT+Span模型在长跨度的中文科技文献实体识别任务中表现最佳,F1值为0.335。上述结果可作为后续研究的测试基准。
High-quality annotated data are crucial for Natural Language Processing(NLP)tasks in the field of Chinese scientific literature.A method of annotation based on a Large Language Model(LLM)was proposed to address the lack of high-quality annotated corpora and the issues of inconsistent and inefficient manual annotation in Chinese scientific literature.First,a fine-grained annotation specification suitable for multi-domain Chinese scientific literature was established to clarify entity types and annotation granularity.Second,a structured text annotation prompt template and a generation parser were designed.The annotation task of Chinese scientific literature was set up as a single-stage,single-round question-and-answer process in which the annotation specifications and text to be annotated were filled into the corresponding slots of the prompt template to construct the task prompt.This prompt was then injected into the LLM to generate output text containing annotation information.Finally,the structured annotation data were obtained by the parser.Subsequently,using prompt learning based on LLM,the Annotated Chinese Scientific Literature(ACSL)entity dataset was generated,which contains 10000 annotated documents and 72536 annotated entities distributed across 48 disciplines.For ACSL,three baseline models based on RoBERTa-wwm-ext,a configuration of the Robustly optimized Bidirectional Encoder Representations from Transformers(RoBERT)approach,were proposed.The experimental results demonstrate that the BERT+Span model performs best on long-span entity recognition in Chinese scientific literature,achieving an F1 value of 0.335.These results serve as benchmarks for future research.
作者
杨冬菊
黄俊涛
YANG Dongju;HUANG Juntao(School of Information,North China University of Technology,Beijing 100144,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2024年第9期113-120,共8页
Computer Engineering
基金
国家自然科学基金重点项目(61832004)
广州市科技计划项目-重点研发计划(202206030009)。
关键词
文本标注方法
中文科技文献
大语言模型
提示学习
信息抽取
text annotation method
Chinese scientific literature
Large Language Model(LLM)
prompt learning
information extraction