本研究旨在通过随机生存森林和Cox比例风险模型,分析并预测肺癌患者的生存时间。研究数据来源于R语言中的survival包中的cancer数据集。首先,采用随机森林方法进行变量选择,结果显示,性别和体能状态是对生存时间具有显著影响的关键变量...本研究旨在通过随机生存森林和Cox比例风险模型,分析并预测肺癌患者的生存时间。研究数据来源于R语言中的survival包中的cancer数据集。首先,采用随机森林方法进行变量选择,结果显示,性别和体能状态是对生存时间具有显著影响的关键变量。接着,本文使用Cox比例风险模型进一步分析了上述变量对生存时间的影响。结果显示,体能状态评分越高,死亡风险越大,而女性患者的生存时间相对较长但统计显著性较低。Cox比例风险模型的分析表明,模型在区分生存时间上的能力较好,且模型整体显著。为了直观展示不同风险组的生存概率差异,绘制了生存曲线,结果表明,高风险组的生存概率显著低于低风险组。通过绘制ROC曲线并计算AUC值,发现模型在区分高低风险患者方面具有中等的预测能力。此外,Bootstrap方法验证了模型的稳定性,性别和体能状态的系数在多次抽样中的估计值较为一致。模型贡献解释中通过Shapley值进一步验证了性别和体能状态是预测生存时间的重要指标,确认了它们在模型中的关键作用。综上所述,本研究通过系统的变量选择、模型分析和多种评估方法,揭示了性别和体能状态对肺癌患者生存时间的显著影响,并验证了模型的稳健性和有效性,为临床实践中预测患者预后提供了重要的参考依据。The aim of this study was to analyze and predict the survival time of lung cancer patients by means of random survival forests and Cox proportional risk models. The study data were obtained from the cancer dataset in the survival package in R language. First, the random forest method was used for variable selection, and the results showed that gender and physical status were the key variables that had a significant effect on survival time. Then, this paper further analyzes the effects of the above variables on survival time using Cox proportional risk model. The results showed that the higher the physical status score, the higher the risk of death, while female patients had relatively longer but less statistically significant survival times. The analysis of the Cox proportional risk model showed the model’s ability to discriminate between survival times was better and the model was overall significant. Survival curves were plotted to visualize the difference in survival probability between different risk groups, and the results showed that the survival probability of the high-risk group was significantly lower than that of the low-risk group. By plotting the ROC curve and calculating the AUC value, the model was found to have moderate predictive ability in distinguishing between high- and low-risk patients. In addition, the Bootstrap method verified the stability of the model, and the coefficients for gender and physical status were more consistent in their estimates across multiple samples. The model contribution interpretation was further validated by the Shapley value that gender and physical fitness status are important predictors of survival time, confirming their key role. In summary, this study revealed the significant effects of gender and physical status on the survival time of lung cancer patients through systematic variable selection, model analysis and multiple assessment methods, and verified the robustness and validity of the model, which provides an important reference for predicting patients’ prognosis in clinical practice.展开更多
基金supported by National Natural Science Foundation of China(11126238,11201006)Humanities and Social Science Projects of Ministry of Education of China(2012YJC2748)
文摘本研究旨在通过随机生存森林和Cox比例风险模型,分析并预测肺癌患者的生存时间。研究数据来源于R语言中的survival包中的cancer数据集。首先,采用随机森林方法进行变量选择,结果显示,性别和体能状态是对生存时间具有显著影响的关键变量。接着,本文使用Cox比例风险模型进一步分析了上述变量对生存时间的影响。结果显示,体能状态评分越高,死亡风险越大,而女性患者的生存时间相对较长但统计显著性较低。Cox比例风险模型的分析表明,模型在区分生存时间上的能力较好,且模型整体显著。为了直观展示不同风险组的生存概率差异,绘制了生存曲线,结果表明,高风险组的生存概率显著低于低风险组。通过绘制ROC曲线并计算AUC值,发现模型在区分高低风险患者方面具有中等的预测能力。此外,Bootstrap方法验证了模型的稳定性,性别和体能状态的系数在多次抽样中的估计值较为一致。模型贡献解释中通过Shapley值进一步验证了性别和体能状态是预测生存时间的重要指标,确认了它们在模型中的关键作用。综上所述,本研究通过系统的变量选择、模型分析和多种评估方法,揭示了性别和体能状态对肺癌患者生存时间的显著影响,并验证了模型的稳健性和有效性,为临床实践中预测患者预后提供了重要的参考依据。The aim of this study was to analyze and predict the survival time of lung cancer patients by means of random survival forests and Cox proportional risk models. The study data were obtained from the cancer dataset in the survival package in R language. First, the random forest method was used for variable selection, and the results showed that gender and physical status were the key variables that had a significant effect on survival time. Then, this paper further analyzes the effects of the above variables on survival time using Cox proportional risk model. The results showed that the higher the physical status score, the higher the risk of death, while female patients had relatively longer but less statistically significant survival times. The analysis of the Cox proportional risk model showed the model’s ability to discriminate between survival times was better and the model was overall significant. Survival curves were plotted to visualize the difference in survival probability between different risk groups, and the results showed that the survival probability of the high-risk group was significantly lower than that of the low-risk group. By plotting the ROC curve and calculating the AUC value, the model was found to have moderate predictive ability in distinguishing between high- and low-risk patients. In addition, the Bootstrap method verified the stability of the model, and the coefficients for gender and physical status were more consistent in their estimates across multiple samples. The model contribution interpretation was further validated by the Shapley value that gender and physical fitness status are important predictors of survival time, confirming their key role. In summary, this study revealed the significant effects of gender and physical status on the survival time of lung cancer patients through systematic variable selection, model analysis and multiple assessment methods, and verified the robustness and validity of the model, which provides an important reference for predicting patients’ prognosis in clinical practice.