期刊文献+

基于重采样与特征选择的不均衡数据分类算法 被引量:16

Unbalanced Data Classification Algorithm Based on Resampling and Feature Selection
下载PDF
导出
摘要 SMOTE算法被广泛应用在不均衡数据研究中,但原始数据集中的噪声数据可能会使数据边界模糊造成数据分布改变.本文基于采样平衡与特征选择提出了BSL-FSRF算法.首先提出BSL采样,将少数类样本分为安全样本、噪声样本、边界样本,只对边界样本进行SMOTE插值,再利用Tomek link进行数据清洗,使数据集基本达到均衡的同时减少噪声样本的数量;其次引入"假设间隔"思想对各个特征维度进行度量,设定合适的阈值,将与类别相关性不高的特征移除,对数据降维;最后以随机森林作为分类器,用改进的网格搜索(Gridsearch)算法进行参数寻优.在公共数据集上对BSL-FSRF算法进行实验验证,结果表明该算法在少数类样本的分类准确率和分类器的整体性能上有明显改善,并且节省了运行时间. The SMOTE algorithm is widely used in the study of unbalanced data,but the noise data in the original data set may cause the data boundary blurring and change the data distribution.This paper proposes the BSL-FSRF algorithm based on sampling balance and feature selection.Firstly,the BSL sampling is proposed.The minority samples are divided into safety samples,noise samples and boundary samples.Only the boundary samples are SMOTE interpolated,and then the Tomek link is used for data cleaning,so that the data set is basically balanced and the number of noise samples is reduced.Secondly,the idea of"hypothesis interval"is introduced to measure each feature dimension,and appropriate thresholds are set.The feature with low correlation is removed to reduce the dimension of the data.Finally,the random forest is used as a classifier and the improved grid search algorithm is used to optimize the parameters.The BSL-FSRF algorithm is experimentally verified on the public dataset.The results show that the algorithm has improved the classification accuracy of the minority samples and the overall performance of the classifier,and saves the running time.
作者 张忠林 曹婷婷 ZHANG Zhong-lin;CAO Ting-ting(College of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2020年第6期1327-1333,共7页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61662043)资助.
关键词 不均衡数据 ReliefF特征选择 重采样 随机森林 分类 imbalanced data ReliefF feature selection resampling random forest classification
  • 相关文献

参考文献6

二级参考文献44

  • 1凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 2He Haibo, Edwardo A. Learning from Imbalanced Data[J]. IEEE Trans. on Knowledge and Data Engineering, 2009, 21(9): 1263- 1284.
  • 3Chawla N V, Japkowicz N, Kolcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets[J]. SIGKDD Explorations,2004, 6(1): 1-6.
  • 4Batista G E A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning TrainingData[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
  • 5Fawcett T. An Introduction to ROC Analysis[J]. Pattern Recognition Letters, 2006, 27(8): 861-874.
  • 6Tan P N, Steinbach M, Kumar V. Introduction to Data Mining[M]. Boston, Massachusetts, USA: Addison Wesley, 2005.
  • 7Bartlett P L, Traskin M. AdaBoost is consistent. Journal of Machine Learning Research, 2007, 8:2347-2368.
  • 8Schapire R E. The convergence rate of AdaBoost [open prob lem]//Proceedings of the 23rd Conference on Learning Theo ry. Haifa, Israel, 2010.
  • 9Japkowicz N. Learning from imbalanced data sets: A com parison of various strategies/ /Proceedings of the AAAI 2000 Workshop, 2000:10-15.
  • 10Chawla N V, Japkowicz N, Kotcz A. Workshop on learning from imbalanced data sets//Proceedings of the ICML' 2003. Washington, DC, USA, 2003.

共引文献148

同被引文献137

引证文献16

二级引证文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部