摘要
大语言模型的频繁发布为大语言模型的评测研究带来了机遇与挑战,针对通用领域大语言模型的评测体系日趋成熟,而面向垂直领域的大语言模型评测仍在起步阶段,本文以古文领域评测为切入点,从语言和知识两个维度构建了一批古籍领域评测任务,并选取当前各大榜单中性能较为优越的13个通用领域大语言模型进行评测。评测结果显示,ERNIE-Bot在古籍领域知识方面遥遥领先于其他模型,而GPT-4模型在语言能力方面表现出最佳性能,在开源模型中,ChatGLM系列模型表现最为出色。通过构建评测任务和数据集,制定了一套适用于古籍领域的大语言模型评测标准,为古籍领域大语言模型性能评测提供了参考,也为后续古籍大语言模型训练过程中的基座模型选取提供了依据。
The rapid development of large language models(LLMs)presents both opportunities and challenges for their evaluation.While evaluation systems for general-domain LLMs are becoming more refined,assessments in specialized fields remain in the early stages.This study evaluates LLMs in the domain of classical Chinese,designing a series of tasks based on two key dimensions:language and knowledge.Thirteen leading general-domain LLMs were selected for evaluation using major benchmarks.The results show that ERNIE-Bot excels in domain-specific knowledge,while GPT-4 demonstrates the strongest language capabilities.Among open-source models,the ChatGLM series exhibits the best overall performance.By developing tailored evaluation tasks and datasets,this study provides a set of standards for evaluating LLMs in the classical Chinese domain,offering valuable reference points for future assessments.The findings also provide a foundation for selecting base models in future domain-specific LLM training.
作者
朱丹浩
赵志枭
张一平
孙光耀
刘畅
胡蝶
王东波
Zhu Danhao;Zhao Zhixiao;Zhang Yiping;Sun GuangYao;Liu Chang;Hu Die;Wang Dongbo(Department of Criminal Science and Technology,Jiangsu Police Institute,Nanjing,210031;School of Information Management,Nanjing Agricultural University,Nanjing,210095)
出处
《信息资源管理学报》
CSSCI
2024年第5期45-58,共14页
Journal of Information Resources Management
基金
国家社科重大基金项目“中国古代典籍跨语言知识库构建与应用研究”(21&ZD331)
江苏省高等学校大学生实践创新创业训练计划项目“面向公安内网文献资源的垂直搜索引擎研究”(202210329046Y)的研究成果之一。
关键词
大语言模型
生成式任务
大模型评测
古籍
领域知识
Large language model
Generative tasks
Large model evaluation
Ancient books
Domain knowledge