摘要
目的分析2型糖尿病(T2DM)发生与多溴联苯醚(PBDEs)暴露的关系,通过机器学习方法构建T2DM发生的预测模型并进行评价。方法在NHANES数据库筛选出1425例研究对象,其中非T2DM患者1132例、T2DM患者293例。比较非T2DM患者与T2DM患者的临床资料,取有统计学差异的临床资料进一步进行boruta特征筛选,以明确T2DM发生与PBDEs的关系及其影响因素。将筛选出的T2DM发生影响因素输入R软件,并使用R软件creatDataPartition函数将数据按照80%训练集及20%验证集随机划分。使用逻辑回归、极致梯度提升(XGBoost)、轻量梯度提升、自适应增强、K近邻、朴素贝叶斯及支持向量机等7种算法构建机器学习模型,将训练集输入模型进行训练,将验证集输入模型使用十折交叉验证对进行模型进行内部验证。结合ROC曲线以及曲线下面积(AUC)对模型进行评价,选择新疆医科大学一附院内分泌科就诊的成人T2DM患者71例及健康体检者100例对效能最好的预测模型进行外部验证。使用SHAP工具分析高效能预测模型的可解释性,判断模型在决策过程中各个特征的重要性。结果T2DM患者BMI、腰围、受教育程度、有糖尿病家族史比例、血清高密度脂蛋白及血清BDE-28、BDE-47、BDE-99、BDE-183、BDE-209浓度均高于非T2DM患者(P均<0.05),Boruta特征筛选出腰围、BMI、糖尿病家族史及血清BDE-47、BDE-99、BDE-28、BDE-209、BDE-183作为T2DM发生的影响因素纳入机器学习算法建立T2DM发生的预测模型。在训练集及验证集的内部验证中,XGBoost模型AUC值均最高,且准确度、Kappa值、灵敏度及特异度均位于前列,故选择XGBoost模型作为高效能预测模型。外部验证结果显示,XGBoost模型的准确度为0.702、灵敏度为0.549、特异度为0.787、AUC(95%CI)为0.674(0.575~0.773)。SHAP工具对XGBoost模型的预测情况进行解释性分析结果显示,腰围、血清BDE-47为最重要的预测特征,同时血清BDE-99、BDE-209及BMI、糖尿病家族史在模型中具有较高的重要性,而血清BDE-28、BDE-183在模型中重要性相对较低。结论血清BDE-47、BDE-99、BDE-28、BDE-209、BDE-183为T2DM发生的独立影响因素,基于血清PBDEs及腰围、BMI、糖尿病家族史建立的XGBoost模型对T2DM发生的预测效能较高,在T2DM发生的预测方面具有一定价值。
Objective To analyse the relationship between the occurrence of type 2 diabetes mellitus(T2DM)and exposure to polybrominated diphenyl ethers(PBDEs),and to construct and evaluate the predictive model for the occurrence of T2DM by machine learning methods.Methods Totally 1425 study subjects were screened in the NHANES database,including 1132 non-T2DM patients and 293 T2DM patients.The clinical data of non-T2DM patients and T2DM patients were compared,and those with statistically significant differences were taken for further screening of boruta features to clarify the relationship between T2DM occurrence and PBDEs and their influencing factors.The screened influencing factors for the occurrence of T2DM were inputinto R software,and the data were randomly partitioned according to 80%training set and 20%validation set using the R software createDataPartition function.Seven algorithms,including logistic regression(Logistcs),extreme gradient boosting(XGBoost),light gradient boosting(LightGBM),adaptive boosting(AdaBoost),K-nearest neighbours(KNN),plain Bayesian(CNB),and support vector machine(SVM)were used to construct the machine learning model,and the training set was input into the model for training,and the validation set was input into the model.The model was internally validated using ten-fold cross-validation pairs.The models were evaluated by combining the ROC curve and AUC,and the model with the best prediction performance was selected for external validation.The best predictive model was externally validated by selecting 71 cases of adult T2DM patients and 100 cases of health check-ups from the Department of Endocrinology of the First Affiliated Hospital of Xinjiang Medical University.The SHAP tool was used to analyse the interpretability of the high-performance prediction models and to judge the importance of each feature of the models in the decision-making process.Results BMI,waist circumference,education level,the proportion with family history of diabetes,serum HDL and serum BDE-28,BDE-47,BDE-99,BDE-183,BDE-209 concentrations were higher in T2DM patients than in non-T2DM patients(all P<0.05).Boruta characteristics screening determined waist circumference,BMI,family history of diabetes and serum BDE-47,BDE-99,BDE-28,BDE-209,and BDE-183 as influencing factors for the occurrence of T2DM,which were incorporated into the machine learning algorithm to construct the predictive model of T2DM occurrence.The XGBoost model had the highest AUC value in both the training set and the internal validation of the validation set,and was in the top rank in terms of accuracy,Kappa value,sensitivity,and specificity,so it was chosen as a high-efficiency prediction model.The results of external validation showed that the XGBoost model had an accuracy of 0.702,a sensitivity of 0.549,a specificity of 0.787,and an AUC(95%CI)of 0.674(0.575-0.773).Interpretive analyses of the predictions of the XGBoost model by the SHAP tool showed that waist circumference and serum BDE-47 were the most important predictive features,while serum BDE-99,BDE-209 and BMI,family history of diabetes had high importance in the model,while serum BDE-28,BDE-183 had relatively low importance in the model.Conclusions Serum BDE-47,BDE-99,BDE-28,BDE-209,and BDE-183 are influential factors for the occurrence of T2DM,and the XGBoost model based on serum PBDEs,waist circumference,BMI,family history of diabetes mellitus has a high predictive efficacy for the occurrence of T2DM,which is of value in the prediction of the occurrence of T2DM.
作者
马英杰
陈楠
阿尔娜·恰依马尔旦
刘早玲
MA Yingjie;CHEN Nan;Aerna Chaimardan;LIU Zaoling(School of Public Health,Xinjiang Medical University,Urumqi 830054,China)
出处
《山东医药》
CAS
2024年第17期1-6,共6页
Shandong Medical Journal
基金
省部共建中亚高发病成因与防治国家重点实验室开放课题项目(SKL-HIDCA-2022-19)
国家自然科学基金项目(82160605)。
关键词
2型糖尿病
多溴联苯醚
多溴联苯醚同系物
机器学习
预测模型
type 2 diabetes mellitus
polybrominated diphenyl ethers
polybrominated diphenyl ether congeners
machine learning
prediction model