基于生成式大语言模型的文献资源自动分类研究

Research on Automatic Classification of Literature Resources Based on Generative Large Language Model

下载PDF

导出

摘要 [目的/意义]探索有效提高文献资源自动层次分类和跨语言分类效果的方法。[方法/过程]将文献资源分类视为分类号生成任务,利用图书馆编目数据构造训练集和测试集,基于ChatGLM 3、Llama 2等大语言模型在训练集上进行模型的高效微调,并在中英文测试集上分析模型的分类效果。[结果/结论]在不同的输出格式中,微调大语言模型使其直接输出分类号,可以获得最优的分类效果;随着训练样本数量的增加,微调后的大语言模型分类效果不断提升;基于22000个样本微调的大语言模型在中图法一级类目和完整分类号的准确率分别可达0.8848、0.5076,优于通用大语言模型;在中文文献上训练的大语言模型可以有效地分类英文文献,分类效果仅比中文文献略低;大语言模型生成的分类号中有少量不是有效的中图分类号。 [Purpose/significance]Explore effective methods to improve the performance of automatic hierarchical classification and cross-language classification of literature resources.[Method/process]Treat literature resource classification as a classification code generation task,use the library’s cataloging data to construct training datasets and test datasets,conduct parameter-efficient fine-tuning of the large language models,such as ChatGLM 3 and Llama 2,on the training dataset,and analyze the classification performance of the model on the Chinese and English test datasets.[Result/conclusion]In different output formats,finetuning the large language model to directly output the classification code can obtain the optimal classification performance;as the number of training samples increases,the classification performance of the fine-tuned large language model continues to improve;the accuracy of the fine-tuned large language model based on 22000 samples can reach 0.8848 and 0.5076 respectively for the firstlevel category and complete classification code of Chinese Library Classification,which is better than the general large language model;the large language models trained on Chinese literature resources can effectively classify English literature resources,and the classification performance is only slightly lower than that of Chinese literature resources.A small number of the classification codes generated by the large language model are not valid Chinese Library Classification Codes.

作者罗鹏程王继民聂磊 Luo Pengcheng;Wang Jimin;Nie Lei(Peking University Library,Beijing 100871;Department of Information Management,Peking University,Beijing 100871;Academy of Regional and Global Governance,Beijing Foreign Studies University,Beijing 100089)

机构地区北京大学图书馆北京大学信息管理系北京外国语大学区域与全球治理高等研究院

出处《情报理论与实践》 CSSCI 北大核心 2024年第12期174-182,共9页 Information Studies:Theory & Application

基金国家社会科学基金项目“面向多语种社会科学数据的线索发现方法研究”的成果,项目编号:22CTQ025。

关键词大语言模型自动分类文献资源层次分类跨语言分类 large language model automatic classification literature resources hierarchical classification cross-language classification

分类号 TP391.1 [自动化与计算机技术—计算机应用技术] G254.1 [文化科学—图书馆学]

引文网络
相关文献

参考文献7

1曹茹烨,曹树金.ChatGPT完成知识组织任务的效果及启示[J].情报资料工作,2023,44(5):18-27. 被引量：19
2杨敏,谷俊.基于SVM的中文书目自动分类及应用研究[J].图书情报工作,2012,56(9):114-119. 被引量：18
3王昊,严明,苏新宁.基于机器学习的中文书目自动分类研究[J].中国图书馆学报,2010,36(6):28-39. 被引量：38
4张智雄,赵旸,刘欢.构建面向实际应用的科技文献自动分类引擎[J].中国图书馆学报,2022,48(4):104-115. 被引量：14
5蒋彦廷.依据《中国图书馆分类法》的英文图书分类探索[J].北京大学学报（自然科学版）,2023,59(1):11-20. 被引量：1
6罗鹏程,王一博,王继民.基于深度预训练语言模型的文献学科自动分类研究[J].情报学报,2020,39(10):1046-1059. 被引量：31
7Koraljka Golub,Johan Hagelback,Anders Ardo.Automatic Classification of Swedish Metadata Using Dewey Decimal Classification:A Comparison of Approaches[J].Journal of Data and Information Science,2020,5(1):18-38. 被引量：2

二级参考文献71

1张婷慧,耿焕同,蔡庆生.一种改进的VSM及其在文本自动分类中的应用[J].微电子学与计算机,2005,22(12):24-27. 被引量：3
2何琳,侯汉清,白振田,张雪英.基于标引经验和机器学习相结合的多层自动分类[J].情报学报,2006,25(6):725-729. 被引量：19
3马金娜,田大钢.基于支持向量机的中文文本自动分类研究[J].系统工程与电子技术,2007,29(3):475-478. 被引量：14
4Sebastiani F. Machine learning in automated text categorization [ J ]. ACM Computing Surveys, 2002, 34 ( 1 ) : 1 - 47.
5Maron M. Automatic indexing: An experimental inquiry[ J]. Journal of the Association for Computing Machinery, 1961, 8(3) : 404 -417.
6Gennari J H, Musen M A, Fergerson R W, et al. The evolution of protege: An environment for knowledge-based systems development [ J ]. International Journal of Human-Computer Studies, 2003, 58(1) : 89 - 123.
7Quinlan J R. Induction of decision tree [ J ]. Machine Learning, 1986,1(1) :81 - 106.
8Quinlan J R. C4.5 : Programs for machine leaning [M]. Los Altos, California: Morgan Kaufmann Publishers, Inc. , 1993.
9Hecht-Nielsen R. Theory of the back propagation neural network [ C ]. In Proceedings of International Joint Conference on Neural Networks, IEEE, 1989, 1:593 - 603.
10Cortes C, Vapnik V. Support-vector network [ J ]. Machine Learning, 1995 (20) : 273 - 297.

共引文献105

1闫慧,贾诗威,吴兆桐,李阳,程宇.2022—2023年情报学前沿进展综述[J].情报学进展,2024(1):420-477.
2曹树金,曹茹烨,李睿婧.数智时代的知识组织研究进展[J].情报学进展,2024(1):318-347.
3王昊,邓三鸿,苏新宁.基于字序列标注的中文关键词抽取研究[J].现代图书情报技术,2011(12):39-45. 被引量：7
4邓三鸿,王昊,秦嘉杭,苏新宁.基于字角色标注的中文书目关键词标引研究[J].中国图书馆学报,2012(2):38-49. 被引量：10
5杨敏,谷俊.基于SVM的中文书目自动分类及应用研究[J].图书情报工作,2012,56(9):114-119. 被引量：18
6黄莉,李湘东.基于《中图法》的自动分类研究现状与展望[J].图书情报知识,2012,29(4):30-36. 被引量：7
7张瑾.基于《中图法》的语义本体相似度技术研究[J].情报科学,2013,31(8):71-76. 被引量：3
8施晓华,李芳.知识体系互操作中贝叶斯学习方法应用研究[J].情报杂志,2013,32(8):165-168. 被引量：2
9贾世杰,郜瑞芹.基于PHOG特征及支持向量机的弯道自动检测[J].计算机工程与设计,2014,35(7):2531-2535. 被引量：3
10李湘东,胡逸泉,巴志超,黄莉.数字图书馆多种类型文献混合自动分类研究[J].图书馆杂志,2014,33(11):42-48. 被引量：8

1黄洁.汉英方位词静态空间语义类型研究[J].外语学刊,2020(5):15-21. 被引量：1
2《化工与医药工程》编辑部.《化工与医药工程》投稿指南[J].化工与医药工程,2024,45(6):48-48.
3刘苹.预设数词的类型学考察[J].外国语,2024,47(3):27-39.
4邢淼,邢军.智慧文旅背景下公共图书馆地方红色文献资源活化路径研究[J].当代图书馆,2024(4):66-72.
5李高兰.图书文献资料参与文旅融合的模式与路径[J].河北画报,2024(24):29-31.
6张琰.公共图书馆口述史资源建设模式探究——基于浙江省口述史资源的调查与统计分析[J].统计科学与实践,2024(10):51-54.
7安晓丽,张陵.数字人文视阈下图书馆编目价值的再思考[J].河南图书馆学刊,2024,44(10):115-117.
8《化工设备与管道》编辑部.《化工设备与管道》投稿指南[J].化工设备与管道,2024,61(6):49-49.
9华东杰,陈英浩.宁波图书馆构建高效多元文献资源体系的探索实践[J].图书馆研究与工作,2024(12):10-12.
10王京星,邱家胜,李雯瑶.营造法治化营商环境的实践探索——以四川省乐山市为例[J].中共乐山市委党校学报,2024,26(6):103-109.

情报理论与实践

2024年第12期

浏览历史

内容加载中请稍等...

基于生成式大语言模型的文献资源自动分类研究

参考文献7

二级参考文献71

共引文献105

相关作者

相关机构

相关主题

浏览历史