期刊文献+

基于生成式大语言模型的文献资源自动分类研究

Research on Automatic Classification of Literature Resources Based on Generative Large Language Model
下载PDF
导出
摘要 [目的/意义]探索有效提高文献资源自动层次分类和跨语言分类效果的方法。[方法/过程]将文献资源分类视为分类号生成任务,利用图书馆编目数据构造训练集和测试集,基于ChatGLM 3、Llama 2等大语言模型在训练集上进行模型的高效微调,并在中英文测试集上分析模型的分类效果。[结果/结论]在不同的输出格式中,微调大语言模型使其直接输出分类号,可以获得最优的分类效果;随着训练样本数量的增加,微调后的大语言模型分类效果不断提升;基于22000个样本微调的大语言模型在中图法一级类目和完整分类号的准确率分别可达0.8848、0.5076,优于通用大语言模型;在中文文献上训练的大语言模型可以有效地分类英文文献,分类效果仅比中文文献略低;大语言模型生成的分类号中有少量不是有效的中图分类号。 [Purpose/significance]Explore effective methods to improve the performance of automatic hierarchical classification and cross-language classification of literature resources.[Method/process]Treat literature resource classification as a classification code generation task,use the library’s cataloging data to construct training datasets and test datasets,conduct parameter-efficient fine-tuning of the large language models,such as ChatGLM 3 and Llama 2,on the training dataset,and analyze the classification performance of the model on the Chinese and English test datasets.[Result/conclusion]In different output formats,finetuning the large language model to directly output the classification code can obtain the optimal classification performance;as the number of training samples increases,the classification performance of the fine-tuned large language model continues to improve;the accuracy of the fine-tuned large language model based on 22000 samples can reach 0.8848 and 0.5076 respectively for the firstlevel category and complete classification code of Chinese Library Classification,which is better than the general large language model;the large language models trained on Chinese literature resources can effectively classify English literature resources,and the classification performance is only slightly lower than that of Chinese literature resources.A small number of the classification codes generated by the large language model are not valid Chinese Library Classification Codes.
作者 罗鹏程 王继民 聂磊 Luo Pengcheng;Wang Jimin;Nie Lei(Peking University Library,Beijing 100871;Department of Information Management,Peking University,Beijing 100871;Academy of Regional and Global Governance,Beijing Foreign Studies University,Beijing 100089)
出处 《情报理论与实践》 CSSCI 北大核心 2024年第12期174-182,共9页 Information Studies:Theory & Application
基金 国家社会科学基金项目“面向多语种社会科学数据的线索发现方法研究”的成果,项目编号:22CTQ025。
关键词 大语言模型 自动分类 文献资源 层次分类 跨语言分类 large language model automatic classification literature resources hierarchical classification cross-language classification
  • 相关文献

参考文献7

二级参考文献71

  • 1张婷慧,耿焕同,蔡庆生.一种改进的VSM及其在文本自动分类中的应用[J].微电子学与计算机,2005,22(12):24-27. 被引量:3
  • 2何琳,侯汉清,白振田,张雪英.基于标引经验和机器学习相结合的多层自动分类[J].情报学报,2006,25(6):725-729. 被引量:19
  • 3马金娜,田大钢.基于支持向量机的中文文本自动分类研究[J].系统工程与电子技术,2007,29(3):475-478. 被引量:14
  • 4Sebastiani F. Machine learning in automated text categorization [ J ]. ACM Computing Surveys, 2002, 34 ( 1 ) : 1 - 47.
  • 5Maron M. Automatic indexing: An experimental inquiry[ J]. Journal of the Association for Computing Machinery, 1961, 8(3) : 404 -417.
  • 6Gennari J H, Musen M A, Fergerson R W, et al. The evolution of protege: An environment for knowledge-based systems development [ J ]. International Journal of Human-Computer Studies, 2003, 58(1) : 89 - 123.
  • 7Quinlan J R. Induction of decision tree [ J ]. Machine Learning, 1986,1(1) :81 - 106.
  • 8Quinlan J R. C4.5 : Programs for machine leaning [M]. Los Altos, California: Morgan Kaufmann Publishers, Inc. , 1993.
  • 9Hecht-Nielsen R. Theory of the back propagation neural network [ C ]. In Proceedings of International Joint Conference on Neural Networks, IEEE, 1989, 1:593 - 603.
  • 10Cortes C, Vapnik V. Support-vector network [ J ]. Machine Learning, 1995 (20) : 273 - 297.

共引文献105

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部