摘要
目的探索国内外大语言模型在临床检验知识题库中的表现。方法选取330道临床医学检验技术中级考试真题来评估国内外共6个大语言模型的性能。使用卡方检验、Fisher精准检验和逻辑回归来评估不同大语言模型在准确性和一致性方面的差异。结果4个英文大模型的准确率及95%置信区间(95%CI)的结果如下:ChatGPT、BingAI、Claude和GPT-4的准确率分别为0.56(95%CI:0.527~0.601)、0.61(95%CI:0.572~0.644)、0.64(95%CI:0.607~0.678)、0.80(95%CI:0.767~0.833)。而星火、天工大模型准确率分别为0.52(95%CI:0.479~0.561)、0.45(95%CI:0.408~0.482)。以ChatGPT作为参考模型,发现BingAI、Claude和GPT-4大模型回答正确的优势比(OR)分别是1.272(95%CI:1.020~1.588)、1.397(95%CI:1.119~1.743)、3.270(95%CI:1.904~2.729),模型表现差异显著(P均<0.05)。一致性方面,天工和BingAI的一致性较差,GPT-4的一致性较好。结论在6个大语言模型中,GPT-4大模型总体及各类型题目准确性和一致性最高。
Objective To explore the performance of domestic and international large language models(LLMs)in the context of question banks for clinical examination knowledge.Methods The performance of six domestic or international LLMs,in the question banks with a set of 330 questions for intermediate-level of clinical medical laboratory technology were assessed.The differences in accuracy and consistency among the different LLMs were evaluated using chi-square tests,Fisher′s exact tests and logistic regression.Results The accuracy results for the four English LLMs along with 95%confidence intervals(95%CI)were as follows:the accuracy rates of ChatGPT,BingAI,Claude and GPT-4 were demonstrated as 0.56(95%CI:0.527-0.601),0.61(95%CI:0.572-0.644),0.64(95%CI:0.607-0.678)and 0.80(95%CI:0.767-0.833)respectively,while the performance of Xinghuo and Tiangong yielded accuracy rates of 0.52(95%CI:0.479-0.561)and 0.45(95%CI:0.408-0.482)respectively.Using ChatGPT as the reference model,we found that the odds ratios(OR)of correct answers of BingAI,Claude and GPT-4 were 1.272(95%CI:1.020-1.588),1.397(95%CI:1.119-1.743)and 3.270(95%CI:1.904-2.729)respectively.The differences of LLMs performance were statistically significant(P<0.05)for all the three models.In terms of consistency,Tiangong and BingAI showed poor consistency,while GPT-4 appeared better.Conclusion Among the six LLMs,GPT-4 demonstrated the highest overall accuracy and consistency in each question category.
作者
刘月嫦
陈紫茹
杨敏
付琛
曾涛
LIU Yuechang;CHEN Ziru;YANG Ming;FU Chen;ZENG Tao(Department of Clinical Laboratory,The Sixth Affiliated Hospital,Sun Yat-sen University,Zhongliu Biomedical Innovation Center,Huangpu District,Guangzhou 510655,Guangdong,China)
出处
《临床检验杂志》
CAS
2023年第12期941-944,共4页
Chinese Journal of Clinical Laboratory Science
基金
广东省消化系统疾病临床医学研究中心项目(2020B1111170004)。
关键词
检验医学
大语言模型
AI大模型
人工智能
clinical laboratory medicine
large language models
AI large model
artificial intelligence