摘要
[目的]对疑似结核病患者进行细胞因子联合检测,并通过多种结合特征选择的机器学习算法对细胞因子水平进行分析,以实现活动性结核病的辅助诊断.[方法]基于42位活动性结核患者和38位非活动性结核患者的血清细胞因子水平数据,采用改进的多种群遗传算法(IMPGA)、多种群遗传算法(MPGA)、粒子群优化算法(PSO)和皮尔逊相关系数(PCC)筛选4种特征选择方法,结合逻辑回归(LR)、支持向量机(SVM)和极端梯度提升(XGBoost)3种分类器,探究活动性结核病分类效果并甄选关键特征.[结果]结合特征选择的机器学习方法相对于无特征选择的机器学习方法直接应用有明显提升,所有方法中IMPGA-SVM分类效果最佳,筛选结果平均特征个数为4.4,受试者工作特征曲线下面积为0.880.分析最佳算法特征选择结果,发现使用结核抗原ESAT6/CFP10融合蛋白刺激后的γ-干扰素诱导单核细胞因子T(MIG-T)出现的次数较其他特征更频繁.[结论]综上,结合特征选择的机器学习方法可辅助诊断活动性结核病.
[Objective] Tuberculosis(TB) constitutes a pervasively infectious disease.In clinical data,a marked disparity in cytokine secretion levels within peripheral blood T lymphocytes,subsequent to tuberculosis-specific antigen stimulation,distinguishes active tuberculosis patients from those with latent infections.These datasets contain varying levels of cytokines before and after antigen stimulation,making them suitable for processing with machine learning techniques.Therefore,this study conducts a cytokine assay in conjunction with machine learning algorithms which incorporate various feature selection strategies to analyze cytokine levels in suspected tuberculosis patients,thereby facilitating the auxiliary diagnosis of active tuberculosis.[Methods] A total of 42 patients with active tuberculosis and 38 patients with inactive tuberculosis were tested for serum cytokine levels.In response to the limitations posed by the reliance of traditional multi-population genetic algorithm(MPGA) on single-criterion fitness functions,an improved MPGA(IMPGA) has been proposed.Using IMPGA,MPGA,particle swarm optimization algorithm(PSO),and Pearson correlation coefficient(PCC) selection,the four feature selection methods are combined with three classifiers,including logistic regression(LR),support vector machine(SVM),extreme gradient boosting(XGBoost),to explore the classification effect of active tuberculosis and select the key features.[Results] Regarding feature selection results,the number of features filtered by the MPGA-SVM,MPGA-XGBoost,IMPGA-SVM,and IMPGA-XGBoost methods is significantly lower than that by the other methods.When classified by the classifier method,the number of selected features follows an increasing order:SVM,XGBoost,LR.However,no obvious pattern is observed when categorized according to the feature selection method.IFN-g-T and MIG-T appear with the highest frequency in the selection results of various methods.When categorized by classifier methods,the most frequent feature selection results for the XGBoost group include IFN-g-T,GBP5-N,and IL-15-N;for the SVM group,it is MIG-T;and for the LR group,it includes IFN-g-T and Eotaxin-T.Nevertheless,there is no clear pattern observed when the feature selection results are classified based on the feature selection method.In terms of feature selection performance combined with classification models,the area under curve(AUC) in the LR group ranged from 0.630 to 0.784,with PCC-LR performing the best,showing a 0.037 improvement over using LR without feature selection.In the SVM group,several algorithms generally outperformed the LR group,with all algorithms in this group achieving AUC values between 0.776 and 0.880.The best-performing algorithm in this group was IMPGA-SVM with an AUC of 0.880,representing a 0.052 increase over using SVM without feature selection.In the XGBoost group,the AUC for all algorithms ranged from 0.722 to 0.832,with the best performance exhibited by IMPGA-XGBoost that achieves an AUC of 0.832,representing a 0.078 increase over using XGBoost without feature selection.Among all the 15 methods evaluated,the best AUC performance is found in IMPGA-SVM,which is 0.880.[Conclusion] Analyzing the selection results of the IMPGA-SVM method,which exhibited the most ideal classification performance,it becomes apparent that the frequency of monokine induced by γ-interferon T(MIG-T) markedly surpasses that of other features.This underscores the pivotal role played by MIG in the prediction of active tuberculosis in patients,aligning with findings from related literature studies.Concurrently,this study has addressed certain deficiencies inherent in the conventional MPGA approach,implementing substantial improvements to the traditional MPGA method,ultimately deriving the optimal model for this research.In this study,when different features were selected,IMPGA showed an average fitness improvement over the traditional MPGA of 0.018,0.008,and 0.010 for the LR,SVM,and XGBoost groups,respectively,thereby enhancing the predictive capability for active tuberculosis in a relatively stable manner.In summary,by employing machine learning methods to assist in the diagnosis of active tuberculosis,coupled with the use of feature selection techniques to reduce feature dimensionality,this study achieves dual objectives:enhancing classification accuracy and identifying key features,thereby increasing the interpretability of the machine learning outcomes.
作者
肖敬达
黄玉麟
刘博闻
刘伟
黄辉彬
张东旭
夏宁邵
XIAO Jingda;HUANG Yulin;LIU Bowen;LIU Wei;HUANG Huibin;ZHANG Dongxu;XIA Ningshao(School of Public Health,Xiamen University,Xiamen 361102,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2024年第1期134-141,共8页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(62003684)。
关键词
细胞因子
活动性结核
特征选择
机器学习
cytokine
active tuberculosis
feature selection
machine learning