摘要
为了评估重采样方法对类不平衡数据集的影响,对被广泛使用的美国威斯康星州的乳腺癌诊断数据集进行研究,基于逻辑斯特回归、支持向量机、随机森林等三种机器学习算法进行实验,对随机上采样抽样、随机下采样抽样、SMOTE以及ADASYN四种重采样方法使用F1值和AUC值进行了分析。实验结果表明,四种重采样方法均可以提升模型性能,其中随机下采样抽样在处理类不平衡数据集时被证明更加有效。
In order to evaluate the impact of resampling methods on class‑imbalanced datasets,an investigation was conducted using the widely recognized Wisconsin breast cancer diagnosis dataset from the United States.Experiments were carried out based on three machine learning algorithms:Logistic Regression,Support Vector Machine,and Random Forest.Four resampling meth‑ods—Random Over‑sampling,Random Under‑sampling,SMOTE,and ADASYN—were analyzed using F1 scores and AUC values.The experimental results indicate that all four resampling methods can improve model performance,with Random Under‑sampling proving to be more effective in handling class‑imbalanced datasets.
作者
丁浩杰
Ding Haojie(School of Big Data and Computer Science,Shanxi Institute of Science and Technology,Jincheng 048000,China)
出处
《现代计算机》
2024年第14期36-40,共5页
Modern Computer
基金
2022年度山西省教育厅高等学校科技创新项目(2022L621)。