摘要
针对不完整乳腺癌数据问题,该研究提出kmeans-KNN方法处理缺失值。首先对训练集进行聚类并采用KNN进行缺失值填充,基于完整训练集训练线性回归模型填充测试集的缺失值,然后使用机器学习算法XGBoost、RF、KNN、SVM对完整训练集进行训练建模,利用建立好的模型对完整测试集进行测试。结果证明kmeans-KNN在缺失值预处理上优于EM、MICE等常用的缺失值填补方法,在准确度和AUC上,kmeans-KNN+SVM取得最优。
Aiming at the problem of incomplete breast cancer data,the study proposed the kmeans-KNN method to deal with missing values.First,cluster the training set and use KNN to fill in missing values,and train a linear regression model based on the complete training set to fill in missing values in the test set.Then,machine learning algorithms XGBoost,RF,KNN,and SVM are used to train and model the complete training set and complete test is used to test.The results show that kmeans-KNN is better than EM,MICE and other common missing value filling methods in missing value preprocessing,and kmeans-KNN+SVM is the best in accuracy and AUC.
作者
邓钰芳
DENG Yufang(School of Computer,Electronics and Information,Guangxi University,Nanning 530004,China)
出处
《现代信息科技》
2021年第7期50-53,共4页
Modern Information Technology
关键词
不完整数据
乳腺癌
诊断预测
incomplete data
breast cancer
diagnosis prediction