摘要
不平衡数据集中各类样本数量不均,导致分类模型难以训练。针对不平衡数据分类模型稳定性差,准确率低的问题,提出一种基于改进C4.5决策树数据分类算法,通过融合SMOTE优化采样算法,构建出N_C4.5-IDC不平衡数据分类模型。模型首先利用K-Means聚类对数据集进行状态分布分析,并使用SMOTE采样法进行混合采样,通过增加人为样本点提高少数类样本数,对数据集进行平衡处理;然后对C4.5决策树的核心信息增益率模型进行简化改进,提高特征选择效率,并采用回缩损失对比的方法对决策树进行后剪枝处理,构建单一N_C4.5决策树模型;最后将多组N_C4.5模型进行组合叠加,采用加权处理的方法构建N_C4.5-IDC模型。消融实验数据结果表明:优化策略的叠加能显著提高模型性能指标。对比实验数据结果表明:与基线分类算法相比,所提算法准确率最高达96.81%,召回率提高了6.15%,综合性能上升了5.66%。综上,基于改进C4.5决策树构建的不平衡数据分类模型在平衡数据的同时,提高了分类的稳定性与准确性。
It is difficult to train the classification model because of the uneven number of samples in the imbalanced data set.Aiming at the problem of poor stability and low accuracy of imbalanced data classification model,this paper proposes a data classification algorithm based on an improved C4.5 decision tree,and constructs an N_C4.5-IDC imbalanced data classification model by integrating the SMOTE optimization sampling algorithm.Firstly,the model uses K-Means clustering to analyze the state distribution of the data set,and uses the SMOTE sampling method to mix sampling,increases the number of minority samples by adding artificial sample points,and balances the data set.Then,the core information gain rate model of the C4.5 decision tree is simplified and improved to improve the efficiency of feature selection,and a single N_C4.5 decision tree model is constructed by post-pruning the decision tree with the method of comparison of retraction loss.Finally,multiple groups of N_C4.5 models are combined and superimposed,and an N_C4.5-IDC model is constructed by the method of weighted processing.The results of ablation experiments show that the superposition of optimization strategies can significantly improve the performance of the model.The experimental results show that compared with the baseline classification algorithm,the accuracy of the proposed algorithm is up to 96.81%,the recall rate is improved by 6.15%,and the comprehensive performance is improved by 5.66%.To sum up,the imbalanced data classification model based on the improved C4.5 decision tree improves the stability and accuracy of classification while balancing the data.
作者
陈婷
谢志龙
CHEN Ting;XIE Zhi-long(Chengdu College,University of Electronic Science and Technology of China,Chengdu Sichuan 611731,China;Southwestern University of Finance and Economics,Chengdu Sichuan 611130,China)
出处
《计算机仿真》
2024年第8期497-501,共5页
Computer Simulation
基金
四川省教育信息技术研究2022年度课题(DSJ2022059)。
关键词
不平衡数据集
决策树混合采样
Imbalanced data set
Decision tree
Hybrid sampling