摘要
【目的】基于SEER数据库,构建胃癌5年生存预测模型,提升模型的判别性能,特别是对生存患者的判别能力,并分析胃癌5年生存影响因素,为胃癌预后评价提供支持。【方法】基于集成学习算法,借鉴EasyEnsemble思想,通过数据层与模型层结合方式处理数据不平衡,基于Bagging方式集成多个Gradient Boosting分类器,据此构建基于不平衡胃癌生存数据的预测模型,并基于SHAP值对胃癌5年生存影响因素进行解释分析。【结果】本文模型准确率达0.808,AUC为0.883,对小类类别的生存患者预测准确率为0.835,与其他模型相比具有更好的胃癌患者5年生存状况预测性能。此外,计算得出阳性淋巴结数量、肿瘤分期分级以及年龄具有较高的SHAP值。【局限】SEER数据库统计的相关预后因素有限,一定程度限制了模型的性能,影响预测结果。【结论】本文模型具有较好的性能,对小类类别的生存患者也具有很好的判别能力。归纳得出阳性淋巴结数量、肿瘤分期分级以及年龄对胃癌患者5年生存概率具有重要影响,符合临床经验。
[Objective] This paper constructs a model to predict the 5-year survival rates for gastric cancer based on the SEER database, aiming to provide support for the prognosis of gastric cancer, as well as analyze factors affecting the patients’ 5-year survival rates. [Methods] With the help of ensemble learning algorithm, especially the idea of EasyEnsemble, we handled data imbalance issue by combining data layer and model layer. Then, we integrated multiple GradientBoosting classifiers with Bagging, and built a prediction model using unbalanced gastric cancer survival data. Finally, we identified factors affecting the 5-year survival of gastric cancer using the SHAP value. [Results] Our new model’s prediction accuracy reached 0.808, with an AUC of 0.883. The prediction accuracy for subcategory survival patients was 0.835. Compared with the traditional models, our method yielded better prediction rates. We also found the regional nodes positive, summary stage/grade, and age had higher SHAP values. [Limitations] The related prognostic factors from the SEER database were limited,which influenced our model’s performance. [Conclusions] The new model could effectively predict survival rates for gastric cancer, and identify factors influencing the 5-year survival probability of the patients.
作者
徐良辰
郭崇慧
Xu Liangchen;Guo Chonghui(Institute of Systems Engineering,Dalian University of Technology,Dalian 116024,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2021年第8期86-99,共14页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目(项目编号:71771034)
中央高校基本科研业务费资助项目(项目编号:DUT21YG108)的研究成果之一。
关键词
生存预测
集成学习
数据不平衡
胃癌
可解释性
Survival Prediction
Ensemble Learning
Data Imbalance
Gastric Cancer
Interpretability