期刊文献+

基于混合采样和特征选择的改进随机森林算法研究 被引量:13

An improved random forest algorithm based on hybrid sampling and feature selection
下载PDF
导出
摘要 随机森林算法是根据Bagging抽样和随机特征子集划分策略,由多棵决策树组成的集成算法。与其他分类算法相比,随机森林算法有更高的分类精度、更低的泛化误差以及训练速度快等特点,因此在数据挖掘领域得到了多方面的应用。然而随机森林算法在分类预测特征维度高且不平衡的数据时,分类性能受到了极大限制。为了更好地处理高维不平衡数据,文中提出了一种基于混合采样和特征选择的改进随机森林算法(Hybrid Samping&Feature Selection Random Forest,HF_RF)。该算法首先从数据层面出发,通过SMOTE算法和随机欠采样相结合的方式对高维不平衡数据集进行预处理,同时引入聚类算法对SMOTE算法进行改进,提高对负类样本的处理性能;然后从算法层面出发,通过ReliefF算法对平衡后的高维数据赋予不同的权值,剔除不相关和冗余特征,对高维数据进行维度约简;最后采用加权投票原则进一步提高算法的分类性能。实验结果显示,改进后的算法与原算法相比,在处理高维不平衡数据方面的各评价指标更高,证明HF_RF算法对于高维不平衡数据的分类性能高于传统随机森林算法。 The random forest algorithm is an integration of multiple decision trees based on the Bagging sampling and the random feature subset division strategy.Compared with other classification algorithms,the random forest algorithm can provide higher classification accuracy,lower generalization error and faster training speed,so it has been applied to many fields involving data mining.However,it performs unsatisfactorily on imbalanced data with high dimensional features.Therefore,this paper proposes an improved random forest algorithm named HF_RF based on hybrid sampling and feature selection.First,the algorithm preprocesses the unbalanced data with high dimensional features from the data level by combining the SMOTE algorithm and the random under-sampling.In the SMOTE algorithm,the clustering algorithm is introduced to improve the capability of dealing with negative samples.Second,from the algorithm level,the ReliefF algorithm is used to assign different weights on the high-dimensional data that have been balanced,and eliminate the irrelevant and redundant features.Thus,the dimensions of the data are reduced.Third,the weighted voting principle is used to further improve the classification ability.The evaluation experiments show that,the proposed algorithm achieves better results in processing high-dimensional unbalanced data,compared with the original random forest algorithm.This demonstrates that the proposed HF_RF algorithm performs better on high-dimensional unbalanced data than the traditional random forest algorithm.
作者 汪力纯 刘水生 WANG Lichun;LIU Shuisheng(College of Telecommunications&Information Engineering,Nanjing Institute of Technology,Nanjing 211167,China;Information Center of Jiangsu Tobacco Monopoly Bureau,Nanjing 210018,China)
出处 《南京邮电大学学报(自然科学版)》 北大核心 2022年第1期81-89,共9页 Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金 国家自然科学基金(61702258)资助项目。
关键词 随机森林 混合采样 特征选择 高维不平衡数据 HF_RF算法 random forest hybrid sampling feature selection high-dimensional unbalanced data HF_RF algorithm
  • 相关文献

参考文献9

二级参考文献67

  • 1程其云,孙才新,张晓星,周湶,杜鹏.以神经网络与模糊逻辑互补的电力系统短期负荷预测模型及方法[J].电工技术学报,2004,19(10):53-58. 被引量:23
  • 2EIRON N,MCCURLEY K S. Analysis of anchor text forweb search [ C ] //Proceedings of the 26th Annual Interna-tional ACM SIGIR Conference on Research and Develop-ment in Information Retrieval. Toronto, Canada: ACM,2003 :459-460.
  • 3GYONGYI Z,MOLINA H. Web spam taxonomy[C]//Proceedings of the 1st International Workshop on Adver-sarial Information Retrieval on the Web. Chiba, Japan :[s. n. ],2005:3947.
  • 4SPIRIN N, HAN J. Survey on Web spam detection: prin-ciples and algorithms [ J]. ACM SIGKDD ExplorationsNewsletter, 2011,13(2):50-64.
  • 5NTOULAS A, NAJORK M, MANASSE M,et al.Detecting spam Web pages through content analysis[ C]//Proceedings of the 15th International Conference on WorldWide Web. New York, USA: ACM, 2006:83-92.
  • 6BECCHETTI L,CASTILLO C, DONATO D,et al.Using rank propagation and probabilistic counting for link-based spam detection [ C ] //Proceedings of the Workshopon Web Mining and Web Usage Analysis (Web KDD).Philadelphia, USA: ACM, 2006: 1-10.
  • 7CASTILLO C,DONATO D, GIONIS A, et al. Knowyour neighbors : Web spam detection using the Web topol-ogy [C ] //Proceedings of the 30th Annual InternationalACM SIGIR Conference. New York, USA: ACM,2007:423-430.
  • 8ERDELYI M,GARZ6 A, BENCZUR A A. Web spamclassification : a few features worth more [ C ] //Proceed-ings of the 2011 Joint WICOW/AIRWeb Workshop onWeb Quality. Hyderabad, India: ACM, 2011:27-34.
  • 9GENG Guanggang, LI Qiudan, ZHANG Xinchang. Linkbased small sample learning for Web spam detection[C ] //Proceedings of the 18th International Conference onWorld Wide Web. Madrid, Spain: ACM, 2009: 1185-1186.
  • 10BIRO I, SIKLOSI D, SZABO J,et al. Linked latentDirichlet allocation in web spam filtering [ C ] //Proceed-ings of the 5 th International Workshop on Adversarial In-formation Retrieval on the Web. Madrid, Spain: ACM,2009:3740.

共引文献154

同被引文献124

引证文献13

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部