摘要
大数据的类别不平衡与维度爆炸问题严重影响着算法的预测效率和分类精度。因此,提出了一种基于插值与特征压缩的大数据分类方法ASE-RFXT。改进ADASYN(adaptive synthetic sampling approach)的插值中心,减少了噪声的引入,改善了少数类样本的分布。改进ReliefF(特征权重法),并将它与集成算法XGDT(extreme gradient dart tree)结合对特征进行并行加权,减少了权重受异常值的影响,使得评估更加准确。利用特征之间的相关性过滤低权重冗余特征,以XGDT的分类精度为评价指标通过SFS(sequential forward selection)压缩特征。实验结果表明ASE-RFXT方法可以降低特征维度,节约训练时间,提高不平衡小样本数据的分类精度。
The problem of category imbalance and dimensional explosion in big data seriously affects the prediction efficiency and classification accuracy of algorithms.Therefore,a classification method ASE-RFXT based on interpolation and feature compression under big data is proposed.Firstly,the interpolation center of ADASYN(adaptive synthetic sampling approach)is improved to reduce the introduction of noise and improve the distribution of minority samples.Secondly,it improves ReliefF and combines with the integrated algorithm XGDT(extreme gradient dart tree)for parallel weighting of features,which reduces the influence of weights by outliers and makes the evaluation more accurate.Finally,it filters low weight redundant features by the correlation between the features,and compresses the features by SFS(sequential forward selection)with the classification accuracy of XGDT as the evaluation index.Experimental results show that the ASE-RFXT algorithm can reduce the feature dimensionality,save training time,and improve the accuracy of classification of unbalanced data.
作者
孙永明
杨进
SUN Yongming;YANG Jin(School of Science,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处
《计算机工程与应用》
CSCD
北大核心
2022年第1期106-112,共7页
Computer Engineering and Applications
基金
国家教育部人文社科规划基金(16YJA630037)
上海市一流学科建设项目(S1201YLXK)。
关键词
极限梯度提升
特征选择
自适应采样
特征加权
extreme gradient boosting
feature selection
adaptive sampling
feature weighted