摘要
[目的/意义]探索有效提高文献资源自动层次分类和跨语言分类效果的方法。[方法/过程]将文献资源分类视为分类号生成任务,利用图书馆编目数据构造训练集和测试集,基于ChatGLM 3、Llama 2等大语言模型在训练集上进行模型的高效微调,并在中英文测试集上分析模型的分类效果。[结果/结论]在不同的输出格式中,微调大语言模型使其直接输出分类号,可以获得最优的分类效果;随着训练样本数量的增加,微调后的大语言模型分类效果不断提升;基于22000个样本微调的大语言模型在中图法一级类目和完整分类号的准确率分别可达0.8848、0.5076,优于通用大语言模型;在中文文献上训练的大语言模型可以有效地分类英文文献,分类效果仅比中文文献略低;大语言模型生成的分类号中有少量不是有效的中图分类号。
[Purpose/significance]Explore effective methods to improve the performance of automatic hierarchical classification and cross-language classification of literature resources.[Method/process]Treat literature resource classification as a classification code generation task,use the library’s cataloging data to construct training datasets and test datasets,conduct parameter-efficient fine-tuning of the large language models,such as ChatGLM 3 and Llama 2,on the training dataset,and analyze the classification performance of the model on the Chinese and English test datasets.[Result/conclusion]In different output formats,finetuning the large language model to directly output the classification code can obtain the optimal classification performance;as the number of training samples increases,the classification performance of the fine-tuned large language model continues to improve;the accuracy of the fine-tuned large language model based on 22000 samples can reach 0.8848 and 0.5076 respectively for the firstlevel category and complete classification code of Chinese Library Classification,which is better than the general large language model;the large language models trained on Chinese literature resources can effectively classify English literature resources,and the classification performance is only slightly lower than that of Chinese literature resources.A small number of the classification codes generated by the large language model are not valid Chinese Library Classification Codes.
作者
罗鹏程
王继民
聂磊
Luo Pengcheng;Wang Jimin;Nie Lei(Peking University Library,Beijing 100871;Department of Information Management,Peking University,Beijing 100871;Academy of Regional and Global Governance,Beijing Foreign Studies University,Beijing 100089)
出处
《情报理论与实践》
CSSCI
北大核心
2024年第12期174-182,共9页
Information Studies:Theory & Application
基金
国家社会科学基金项目“面向多语种社会科学数据的线索发现方法研究”的成果,项目编号:22CTQ025。
关键词
大语言模型
自动分类
文献资源
层次分类
跨语言分类
large language model
automatic classification
literature resources
hierarchical classification
cross-language classification