期刊文献+

面向非平衡数据的癌症患者生存预测分析 被引量:4

Survival Prediction Analysis of Cancer Patients Oriented to Unbalanced Data
下载PDF
导出
摘要 针对癌症数据集中存在非平衡数据及噪声样本的问题,提出一种基于RENN和SMOTE算法的癌症患者生存预测算法RENN-SMOTE-SVM。基于最近邻规则,利用RENN算法减少多数类样本中噪声样本数量,并通过SMOTE算法在少数类样本间进行线性插值增加样本数量,从而获得平衡数据集。基于美国癌症数据库非平衡乳腺癌患者数据集对癌症患者的生存情况进行预测分析,实验结果表明,与SVM算法、Tomeklinks-SVM算法等5种常用算法相比,该算法的分类及预测效果更好,其正确率、F1-score、G-means值分别为0.883,0.904,0.779。 The survival analysis of cancer patients generally suffers from unbalanced data sets and noisy samples.To address the problem,this paper proposes an algorithm to predict the survival of cancer patients.The algorithm,named RENN-SMOTESVM,is constructed based on the RENN algorithm and the SMOTE algorithm.The RENN algorithm is used to reduce the number of noisy samples in the majority class based on the nearest neighbor rule.The SMOTE algorithm is used to linearly interpolate between the minority class samples to increase the number of samples,and finally a balanced data set is obtained.The proposed algorithm is tested by performing prediction analysis on the unbalanced data set of breast cancer patients in the American Cancer Database.The experimental results show that the RENN-SMOTE-SVM algorithm displays better classification and prediction results than SVM,Tomeklinks-SVM and other three mainstream algorithms.It provides an accuracy of 0.883,F1 score of 0.904 and G-means value of 0.779.
作者 苗立志 白瑞思蒙 刘成良 翟月昊 MIAO Lizhi;BAI Ruisimeng;LIU Chengliang;ZHAI Yuehao(College of Geographical and Biological Information,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;Smart Health Big Data Analysis and Location Services Engineering Laboratory of Jiangsu Province,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;College of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处 《计算机工程》 CAS CSCD 北大核心 2021年第12期316-320,共5页 Computer Engineering
基金 江苏省“双创博士”项目(CZ032SC20025)。
关键词 疾病预测 机器学习 数据分析 非平衡数据 SMOTE算法 disease prediction machine learning data analysis unbalanced data SMOTE algorithm
  • 相关文献

参考文献7

二级参考文献37

  • 1杨玲.国际与国内肿瘤登记概况[J].中国肿瘤,2005,14(12):772-775. 被引量:22
  • 2叶定伟,李长岭.前列腺癌发病趋势的回顾和展望[J].中国癌症杂志,2007,17(3):177-180. 被引量:119
  • 3Chawla N V,Japkowicz N,Kotcz A.Editorial:special issue on learning from imbalanced data sets[J].ACM SIGKDD Explorations Newsletter,2004,6(1):1-6.
  • 4Yang Q,Wu X.10 challenging problems in data mining research[J].International Journal of Information Technology & Decision Making,2006,5(4):597-604.
  • 5Sokolova M,Japkowicz N,Szpakowicz S.Beyond accuracy,F-score and ROC:a family of discriminant measures for performance evaluation[C]∥Proceedings of the 2006 Australian Joint Conference on Artificial Intelligence (AI 2006).Hobart:Springer,2006:1015-1021.
  • 6Caruana R,Niculescu-Mizil A.Data mining in metric space:an empirical analysis of supervised learning performance criteria[C]∥Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD 2004).Seattle:ACM Press,2004:69-78.
  • 7Ferri C,Hernández-Orallo J,Modroiu R.An experimental comparison of performance measures for classification[J].Pattern Recognition Letters,2009,30(1):27-38.
  • 8Vapnik V N.Statistical learning theory[M].New York:John Wiley & Sons,1998.
  • 9Duda R O,Hart P E,Stork D G.Pattern Classification[M].2nd ed.New York:John Wiley & Sons,2001.
  • 10Yan L,Dodier R,Mozer M C,et al.Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic[C]∥Proceedings of the 20th International Conference on Machine Learning (ICML 2003).Washington:AAAI Press,2003:848-855.

共引文献112

同被引文献39

引证文献4

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部