摘要
针对癌症数据集中存在非平衡数据及噪声样本的问题,提出一种基于RENN和SMOTE算法的癌症患者生存预测算法RENN-SMOTE-SVM。基于最近邻规则,利用RENN算法减少多数类样本中噪声样本数量,并通过SMOTE算法在少数类样本间进行线性插值增加样本数量,从而获得平衡数据集。基于美国癌症数据库非平衡乳腺癌患者数据集对癌症患者的生存情况进行预测分析,实验结果表明,与SVM算法、Tomeklinks-SVM算法等5种常用算法相比,该算法的分类及预测效果更好,其正确率、F1-score、G-means值分别为0.883,0.904,0.779。
The survival analysis of cancer patients generally suffers from unbalanced data sets and noisy samples.To address the problem,this paper proposes an algorithm to predict the survival of cancer patients.The algorithm,named RENN-SMOTESVM,is constructed based on the RENN algorithm and the SMOTE algorithm.The RENN algorithm is used to reduce the number of noisy samples in the majority class based on the nearest neighbor rule.The SMOTE algorithm is used to linearly interpolate between the minority class samples to increase the number of samples,and finally a balanced data set is obtained.The proposed algorithm is tested by performing prediction analysis on the unbalanced data set of breast cancer patients in the American Cancer Database.The experimental results show that the RENN-SMOTE-SVM algorithm displays better classification and prediction results than SVM,Tomeklinks-SVM and other three mainstream algorithms.It provides an accuracy of 0.883,F1 score of 0.904 and G-means value of 0.779.
作者
苗立志
白瑞思蒙
刘成良
翟月昊
MIAO Lizhi;BAI Ruisimeng;LIU Chengliang;ZHAI Yuehao(College of Geographical and Biological Information,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;Smart Health Big Data Analysis and Location Services Engineering Laboratory of Jiangsu Province,Nanjing University of Posts and Telecommunications,Nanjing 210023,China;College of Telecommunications&Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2021年第12期316-320,共5页
Computer Engineering
基金
江苏省“双创博士”项目(CZ032SC20025)。