AI生成与学者撰写中文论文摘要的检测与差异性比较研究被引量：9

Detection and Comparative Study of Differences Between AI-Generated and Scholar-Written Chinese Abstracts

下载PDF

导出

摘要 [研究目的]该研究从实证角度对AI生成与学者撰写的中文论文摘要的检测方法进行研究,并分析其文本内容特征差异,可为AI生成文本的自动检测及相关研究提供参考。[研究方法]首先,以图书馆学领域100篇高被引论文为例,基于论文题目应用GPT-4大模型生成相应的摘要,构建分析数据集;其次,采用有监督的机器学习和深度预训练模型对GPT-4生成和学者撰写的摘要进行分类检测,同时采用查重软件对内容的重复率进行检测;最后,分别从摘要长度、句子数量、词汇特征、常用搭配等维度,揭示AI生成与学者撰写中文论文摘要之间的异同点。[研究结论]基于训练语料所搭建的分类器可有效识别中文论文摘要是否由AI生成,其中,逻辑回归(Logistic)、集成学习模型(RF、LightGBM)和BERT模型的F_(1)-Score均超过90%。AI生成的摘要呈现出较高的同质性,具有较强的写作逻辑性,并惯用归纳总结等学术话语体系;而学者撰写的摘要则具有显著的个性化差异,使用凸显实际含义的搭配较多,并常用与国家政策密切相关的词语。 [Research purpose]This study investigates the detection methods of AI-generated and scholar-written Chinese paper abstracts from an empirical perspective,and analyzes the differences of text content features,providing a reference for the automatic detection of AI-generated text and related research.[Research method]First,using 100 highly cited papers in the field of library science as an example,we generate corresponding abstracts based on the paper titles using the GPT-4 large model,and construct an analysis dataset.Next,we employ supervised machine learning and deep pre-trained models to classify and detect GPT-4-generated and scholar-written abstracts,and use plagiarism detection software to examine content duplication rates.Finally,we reveal the similarities and differences between AI-generated and scholar-written Chinese paper abstracts in terms of abstract length,sentence count,lexical features,and common collocations.[Research conclusion]The classifier built based on the training corpus can effectively identify whether the Chinese paper abstract is generated by AI,among which,the F_(1)-Score of logistic regression(Logistic),ensemble learning models(RF,LightGBM)and BERT model are all over 90%.AI-generated summaries present a high degree of homogeneity,have strong writing logic,and habitually use academic discourse systems such as induction and summary;while the abstract written by scholars has significant individual differences,uses more word combinations that highlight the actual meaning,and often uses words closely related to national policies.

作者王一博郭鑫刘智锋王继民 Wang Yibo;Guo Xin;Liu Zhifeng;Wang Jimin(Department of Information Management,Peking University,Beijing 100871;Peking University Library,Beijing 100871)

机构地区北京大学信息管理系北京大学图书馆

出处《情报杂志》北大核心 2023年第9期127-134,共8页 Journal of Intelligence

基金国家社会科学基金重点项目“开放科学数据集统一发现的关键问题与平台构建研究”(编号:20ATQ007)的研究成果。

关键词图书馆学 AIGC GPT-4 论文摘要摘要检测文本分类 library science AIGC GPT-4 paper abstract abstract detect text classification

分类号 G353 [文化科学—情报学]